Decompose tensordot function to avoid memory #821

mrocklin · 2015-11-07T22:52:10Z

Consider a Matrix-Vector multiply z = W.dot(v) for W a short and fat matrix and v a tall vector. Under common chunking approaches this has a single output chunk

     v
     v    
     v
WWWW z

To compute this efficiently in small space we want to compute the three chunks first W1.dot(v1), W2.dot(v2), W3.dot(v3), allow their children to be released, and then sum these intermediate results to our final result.

Unfortunately this is not what the current tensordot function does. It provides the two iterators [W1, W2, W3], [v1, v2, v3] to a function many, which does both the per-block tensordot and the resulting sum. Because of this choice, all of W and all of v will be in memory at once.

So we should change this behavior. The only issue here is that the full scope of tensordot is rather large and it largely depends on atop which, while beautiful and saves us untold amounts of frustration, also isn't something that we can easily tweak.

The text was updated successfully, but these errors were encountered:

mrocklin · 2015-11-08T02:47:17Z

@shoyer @clarkfitzg solutions to this problem may come close to your desire to support more sophisticated map+atop solutions. If you have time to think about that application and what a reasonable feature set might look like, then now would be a good time to write up an issue for it (if there isn't one already.)

mrocklin · 2015-11-08T03:20:24Z

Looking over use cases of atop it may be reasonable to combine the current implementations of atop and many.

Currently atop takes one function and it gives this one function blocks or iterators of blocks for each free index in the output (iterators are caused when we have dummy indices.)

So in x_ij + y_jk atop calls the input function n_i * n_k times, each time on an iterator of length n_j.

One alternate proposal is to provide atop with two functions, a first mapping function to be called for each index possibility (both free and dummy indices), and a second function to reduce over the dummy indices.

So in x_ij + y_jk we would call the mapper functionn_i * n_j * n_k times, each time on a single block per input (two inputs). We would then call a second reduction function n_i * n_k times on those results, much in the same way we currently call atop.

mrocklin · 2015-11-08T04:08:08Z

On second thought. I can achieve this now with multiple calls to atop.

Previously the main operation used to compute da.tensordot computed the chunkwise tensordots and summed them together in a single in-memory operation. This possibly resulted in memory blowup, particularly in short-and-fat .dot. tall-and-skinny cases. Now we treat each sub-tensordot call as an independent task. This allows the scheduler to clear out intermediate results more intelligently. We expose more of the algorithm to the scheduler. Fixes dask#821

shoyer · 2015-11-09T05:13:32Z

@mrocklin My thought for an automatic wrapper of atop/top in xray is for strict map-type operations, where the reduce operation just raises an error if there are multiple blocks and otherwise unpacks. So I don't think we directly need this.

That said, I've found the current way atop works to be a little confusing, precisely because it conflates the map and reduce steps into a single function. So I do like the idea of changing these functions to separate the map and reduce steps. The advantage of a task graph accessible to dask makes this even more appealing, and will be useful for other uses of atop (e.g., einsum #732).

mrocklin · 2015-11-09T15:26:18Z

I propose that atop does not conflate a map and reduce step into one step, but rather that it is a generalization that happens to include both as special cases.

The solution here in tensordot was to separate things exactly like how you suggest. This was done just by fiddling with indices and calling atop twice. I'm fairly confident that you can do what you want with atop, though with an admittedly high learning curve.

If you have an explicit API and set of test cases then I can probably just bang out the function that you need.

clarkfitzg · 2015-11-09T20:02:31Z

Thanks @mrocklin- I'm definitely interested in doing more with this, but I first need to invest the time to understand what's currently possible before I can provide more useful feedback. And that's going to have to wait until my schedule clears up, hopefully this spring.

mrocklin mentioned this issue Nov 8, 2015

Expand tensordot to expose intermediates #822

Merged

mrocklin closed this as completed in #822 Nov 9, 2015

sinhrks added this to the 0.7.6 milestone Jan 7, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decompose tensordot function to avoid memory #821

Decompose tensordot function to avoid memory #821

mrocklin commented Nov 7, 2015

mrocklin commented Nov 8, 2015

mrocklin commented Nov 8, 2015

mrocklin commented Nov 8, 2015

shoyer commented Nov 9, 2015

mrocklin commented Nov 9, 2015

clarkfitzg commented Nov 9, 2015

Decompose tensordot function to avoid memory #821

Decompose tensordot function to avoid memory #821

Comments

mrocklin commented Nov 7, 2015

mrocklin commented Nov 8, 2015

mrocklin commented Nov 8, 2015

mrocklin commented Nov 8, 2015

shoyer commented Nov 9, 2015

mrocklin commented Nov 9, 2015

clarkfitzg commented Nov 9, 2015