Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dask.array dtype tracking to match numpy #64

Closed
mrocklin opened this issue Mar 11, 2015 · 4 comments · Fixed by #87
Closed

Dask.array dtype tracking to match numpy #64

mrocklin opened this issue Mar 11, 2015 · 4 comments · Fixed by #87

Comments

@mrocklin
Copy link
Member

Dask.arrays don't currently track dtype metadata. The dtype property actually has to compute a little bit of the array. For some computations this can be expensive. Perhaps we should add a new field to the dask.array.core.Array class and track dtype information through all operations. This will require us to replicate the numpy dtype coercion/promotion rules precisely which might get tricky.

At first glance this doesn't seem trivial.

@mrocklin mrocklin changed the title Dask.array dtypes Dask.array dtype tracking to match numpy Mar 11, 2015
@shoyer
Copy link
Member

shoyer commented Mar 24, 2015

From playing around with dask.array for a day, avoiding checking dtypes can be a bit tricky -- and it's very noticeable when it happens, because of the 100ms overhead for starting up the asynchronous task scheduler. For example, I need to check for floating point data whenever I do an array reduction, to know whether I should use mean or nanmean. If I want to take the per-variable mean of a netCDF file with 10 variables that adds up to 1s of overhead, even on very small data.

This ends up a being a little surprising because I'm used to checking the dtype being essentially free.

Another example: suppose I want to save those ten arrays back to disk. I'll need to check the dtype for each so I can create on-disk arrays with the right types. In principle, we could add a special method to calculate all those dtypes in one go (we'll need that for saving the data itself, eventually), but it's awkward.

Short of replicating all numpy dtype rules in dask, you might consider:

  1. Adding some sort of heuristics to the task scheduler to decide whether or not to boot up the whole thing. For example, perhaps dtype checks should never be multithreaded? Or if all the arrays contained in the graph are small enough? Obviously, this could easily go wrong...
  2. Make scalar proxy arrays to do the dtype inference in operations where the promotion rules might be hard to get right (e.g. for arithmetic operations or reductions).

Even lower hanging fruit would be caching the dtype or even the entire result of calling .compute() on a dask array -- that task scheduler is expensive to get going!

@mrocklin
Copy link
Member Author

Hrm, we could short-circuit the threaded scheduler in the case when there is a single linear chain of tasks (this is cheap to detect.) This would make dtype-by-computation checks fast in cases where they can be fast.

I wonder how hard it would be to carry around dtype information. I suspect that most operations are relatively straightforward and tricks like you suggest in point 2 might help in the hard cases. Blaze does this already (and wrapping dask with blaze is another, slightly heavier weight solution) and hooking in to numpy promotion rules wasn't too hard.

If we do decide to replicate numpy dtype rules then this might be an interesting task for @cowlicks.

BTW, computing many things at once is already available. See #75

@shoyer
Copy link
Member

shoyer commented Mar 24, 2015

I did see #75, but it won't work (directly) for the dtype attribute.

Another option (at least for this immediate issue), would be to put reduction operations with a skipna option in dask itself. I do carrying around dtype information is ultimately probably the best solution.

@mrocklin
Copy link
Member Author

I'll take a crack at adding dtype information today. I'll probably organize it so that we can add to it partially and have it fail gracefully to the current computation solution. This will hopefully get us past the important use cases without having to implement it everywhere.

mrocklin added a commit to mrocklin/dask that referenced this issue Mar 24, 2015
The Array class now holds a `_dtype` attribute.  Various dask.array functions
propagate dtype information, repeating a bit of numpy logic where necessary.

If this logic fails then we fall back on computation of a small element of the
dask array.

Fixes dask#64
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants