Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-using tensor storage when possible #664

Merged
merged 22 commits into from
Apr 5, 2023
Merged

Re-using tensor storage when possible #664

merged 22 commits into from
Apr 5, 2023

Conversation

coreylowman
Copy link
Owner

@coreylowman coreylowman commented Apr 5, 2023

For operations that are made up of many sub-operations like batchnorm, we currently allocate new data for each operation. This notably doesn't use the ownership model of rust. For instance calling t.add(0.1).sqrt() can modify t's buffer in place. The backward call for scalar addition doesn't even need to keep a reference to t, and sqrt can keep a reference to its output.

Summary

  1. Adding GhostTensor which holds id/length/shape/strides, but importantly NOT a reference to the data.
  2. Change Gradients to use GhostTensor for all internal methods
  3. Adds multiple const bools to unary/binary ops
  4. Updates try_unary_op and try_binary_op to handle the new const bools

Unary Operations

In general there are three cases for unary operations:

  1. Unary operation has a constant derivative - the scalar operations
  2. Unary operation has a derivative that can use f(x) - exp, sigmoid, sqrt, tanh
  3. Unary operation requires input data

For case 1 we can re-use the input data, and not keep a reference to the output. For case 2, we can re-use the input buffer, and keep a reference to the output. Case 3 is what happens now, which is we allocate & keep a reference to the input.

Binary Operations

There are only two cases for binary operations:

  1. It has a constant derivative - add & sub
  2. It requires input data to calculate derivatives - everything else

For case 1, we can re-use the input buffers, and not keep a reference to the output. This actually saves a lot because add and sub are very common operations. We do have to be careful about broadcasts, so we can actually only reuse an input buffer if its contiguous.

There is also a heuristic of trying to use an input buffer with only 1 reference, since we can try to pick between either lhs and rhs.

Results

Here are some current results on my dev CPU for batchnorm2d & softmax.

CPU

cargo bench --bench batchnorm2d:

branch fwd bwd
main 400ms 494ms
this 270ms 400ms

cargo bench --bench softmax:

branch fwd bwd
main 763ms 570ms
this 590ms 480ms

A10 GPU

cargo bench -F cuda --bench batchnorm2d:

branch fwd bwd
main 4.5ms 32ms
this 8ms 26ms

cargo bench -F cuda --bench softmax:

branch fwd bwd
main 13.75ms 49ms
this 7.5ms 43ms

@coreylowman coreylowman changed the title [WIP] Re-using tensor storage when possible Re-using tensor storage when possible Apr 5, 2023
@coreylowman coreylowman merged commit 3ec1042 into main Apr 5, 2023
@coreylowman coreylowman deleted the phantom-tensor branch April 5, 2023 16:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant