Re-using tensor storage when possible #664

coreylowman · 2023-04-05T01:22:56Z

For operations that are made up of many sub-operations like batchnorm, we currently allocate new data for each operation. This notably doesn't use the ownership model of rust. For instance calling t.add(0.1).sqrt() can modify t's buffer in place. The backward call for scalar addition doesn't even need to keep a reference to t, and sqrt can keep a reference to its output.

Summary

Adding GhostTensor which holds id/length/shape/strides, but importantly NOT a reference to the data.
Change Gradients to use GhostTensor for all internal methods
Adds multiple const bools to unary/binary ops
Updates try_unary_op and try_binary_op to handle the new const bools

Unary Operations

In general there are three cases for unary operations:

Unary operation has a constant derivative - the scalar operations
Unary operation has a derivative that can use f(x) - exp, sigmoid, sqrt, tanh
Unary operation requires input data

For case 1 we can re-use the input data, and not keep a reference to the output. For case 2, we can re-use the input buffer, and keep a reference to the output. Case 3 is what happens now, which is we allocate & keep a reference to the input.

Binary Operations

There are only two cases for binary operations:

It has a constant derivative - add & sub
It requires input data to calculate derivatives - everything else

For case 1, we can re-use the input buffers, and not keep a reference to the output. This actually saves a lot because add and sub are very common operations. We do have to be careful about broadcasts, so we can actually only reuse an input buffer if its contiguous.

There is also a heuristic of trying to use an input buffer with only 1 reference, since we can try to pick between either lhs and rhs.

Results

Here are some current results on my dev CPU for batchnorm2d & softmax.

CPU

cargo bench --bench batchnorm2d:

branch	fwd	bwd
main	400ms	494ms
this	270ms	400ms

cargo bench --bench softmax:

branch	fwd	bwd
main	763ms	570ms
this	590ms	480ms

A10 GPU

cargo bench -F cuda --bench batchnorm2d:

branch	fwd	bwd
main	4.5ms	32ms
this	8ms	26ms

cargo bench -F cuda --bench softmax:

branch	fwd	bwd
main	13.75ms	49ms
this	7.5ms	43ms

…into phantom-tensor

coreylowman added 14 commits April 4, 2023 19:39

Tmp commit of phantom tensors

a55e584

Merge branch 'main' into phantom-tensor

275966f

Fixing all kernels to use ghost tensors

acb8d68

Adding backward without data

9ed6fdf

Adding forward_reuse/backward_without_data for unary ops

ea6620c

binary kernel reuse

431a4ca

Reusing more for unary op

98353b6

Removing refernece to output data

447a9de

Fixing nightly ops

2cfb3fd

Refactor to use Result in ops

10d7538

Cuda check passing

2f54049

Marking kernels as const/df_uses_fx

36ee594

Updates

ffd6253

Not saving inp for sum_to

c8e6b7c

coreylowman force-pushed the phantom-tensor branch from 085ec54 to c8e6b7c Compare April 5, 2023 13:44

coreylowman added 2 commits April 5, 2023 13:58

Merge branch 'main' into phantom-tensor

64ac308

Updates recip

9ec620b

coreylowman changed the title ~~[WIP] Re-using tensor storage when possible~~ Re-using tensor storage when possible Apr 5, 2023

coreylowman and others added 6 commits April 5, 2023 14:28

Style

8c56280

Merge branch 'main' into phantom-tensor

1596966

Prefering to use RHS when possible in cpu kernel

2e2571f

Prefering RHS for cuda as well

9712b45

Merge branch 'phantom-tensor' of https://github.com/coreylowman/dfdx …

ab123c6

…into phantom-tensor

Styling

d92a24f

coreylowman merged commit 3ec1042 into main Apr 5, 2023

coreylowman deleted the phantom-tensor branch April 5, 2023 16:47

coreylowman mentioned this pull request Apr 7, 2023

Reuse allocations during inference forward #671

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-using tensor storage when possible #664

Re-using tensor storage when possible #664

coreylowman commented Apr 5, 2023 •

edited

Loading

Re-using tensor storage when possible #664

Re-using tensor storage when possible #664

Conversation

coreylowman commented Apr 5, 2023 • edited Loading

Summary

Unary Operations

Binary Operations

Results

CPU

A10 GPU

coreylowman commented Apr 5, 2023 •

edited

Loading