Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decompose tensordot function to avoid memory #821

Closed
mrocklin opened this issue Nov 7, 2015 · 6 comments
Closed

Decompose tensordot function to avoid memory #821

mrocklin opened this issue Nov 7, 2015 · 6 comments
Milestone

Comments

@mrocklin
Copy link
Member

mrocklin commented Nov 7, 2015

Consider a Matrix-Vector multiply z = W.dot(v) for W a short and fat matrix and v a tall vector. Under common chunking approaches this has a single output chunk

     v
     v    
     v
WWWW z

To compute this efficiently in small space we want to compute the three chunks first W1.dot(v1), W2.dot(v2), W3.dot(v3), allow their children to be released, and then sum these intermediate results to our final result.

Unfortunately this is not what the current tensordot function does. It provides the two iterators [W1, W2, W3], [v1, v2, v3] to a function many, which does both the per-block tensordot and the resulting sum. Because of this choice, all of W and all of v will be in memory at once.

So we should change this behavior. The only issue here is that the full scope of tensordot is rather large and it largely depends on atop which, while beautiful and saves us untold amounts of frustration, also isn't something that we can easily tweak.

@mrocklin
Copy link
Member Author

mrocklin commented Nov 8, 2015

@shoyer @clarkfitzg solutions to this problem may come close to your desire to support more sophisticated map+atop solutions. If you have time to think about that application and what a reasonable feature set might look like, then now would be a good time to write up an issue for it (if there isn't one already.)

@mrocklin
Copy link
Member Author

mrocklin commented Nov 8, 2015

Looking over use cases of atop it may be reasonable to combine the current implementations of atop and many.

Currently atop takes one function and it gives this one function blocks or iterators of blocks for each free index in the output (iterators are caused when we have dummy indices.)

So in x_ij + y_jk atop calls the input function n_i * n_k times, each time on an iterator of length n_j.

One alternate proposal is to provide atop with two functions, a first mapping function to be called for each index possibility (both free and dummy indices), and a second function to reduce over the dummy indices.

So in x_ij + y_jk we would call the mapper functionn_i * n_j * n_k times, each time on a single block per input (two inputs). We would then call a second reduction function n_i * n_k times on those results, much in the same way we currently call atop.

@mrocklin
Copy link
Member Author

mrocklin commented Nov 8, 2015

On second thought. I can achieve this now with multiple calls to atop.

mrocklin added a commit to mrocklin/dask that referenced this issue Nov 8, 2015
Previously the main operation used to compute da.tensordot computed
the chunkwise tensordots and summed them together in a single in-memory
operation.  This possibly resulted in memory blowup, particularly in
short-and-fat .dot. tall-and-skinny cases.

Now we treat each sub-tensordot call as an independent task.  This
allows the scheduler to clear out intermediate results more
intelligently.  We expose more of the algorithm to the scheduler.

Fixes dask#821
mrocklin added a commit to mrocklin/dask that referenced this issue Nov 8, 2015
Previously the main operation used to compute da.tensordot computed
the chunkwise tensordots and summed them together in a single in-memory
operation.  This possibly resulted in memory blowup, particularly in
short-and-fat .dot. tall-and-skinny cases.

Now we treat each sub-tensordot call as an independent task.  This
allows the scheduler to clear out intermediate results more
intelligently.  We expose more of the algorithm to the scheduler.

Fixes dask#821
@shoyer
Copy link
Member

shoyer commented Nov 9, 2015

@mrocklin My thought for an automatic wrapper of atop/top in xray is for strict map-type operations, where the reduce operation just raises an error if there are multiple blocks and otherwise unpacks. So I don't think we directly need this.

That said, I've found the current way atop works to be a little confusing, precisely because it conflates the map and reduce steps into a single function. So I do like the idea of changing these functions to separate the map and reduce steps. The advantage of a task graph accessible to dask makes this even more appealing, and will be useful for other uses of atop (e.g., einsum #732).

@mrocklin
Copy link
Member Author

mrocklin commented Nov 9, 2015

I propose that atop does not conflate a map and reduce step into one step, but rather that it is a generalization that happens to include both as special cases.

The solution here in tensordot was to separate things exactly like how you suggest. This was done just by fiddling with indices and calling atop twice. I'm fairly confident that you can do what you want with atop, though with an admittedly high learning curve.

If you have an explicit API and set of test cases then I can probably just bang out the function that you need.

@clarkfitzg
Copy link
Contributor

Thanks @mrocklin- I'm definitely interested in doing more with this, but I first need to invest the time to understand what's currently possible before I can provide more useful feedback. And that's going to have to wait until my schedule clears up, hopefully this spring.

@sinhrks sinhrks added this to the 0.7.6 milestone Jan 7, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants