# AutoGrad

In [1]:
import torch

## Overview

In PyTorch, AutoGrad is the mechanism used to perform
the gradient computation in back propogation. To do this,
torch keeps track of gradients in the "compute graph". Torch
has a mechanism to identify which things to keep track of. 
To do this, we specify whether to track gradients of operations
on given tensors. 

The following code creates a tensor for which we track gradients.

In [2]:
x = torch.ones((2, 2), requires_grad=True)
print(f'[+] x:\n{x}')

[+] x:
tensor([[1., 1.],
        [1., 1.]], requires_grad=True)


## torch.Function

There are also Functions in torch. Functions and tensors are what make up the
computation graph. All Tensors, save those that are allocated by the user, 
have a function that generates them. Seemingly this is the same as an op in
Tensorflow. Though, it's worth noting that tensorflow actually store gradient
operations in the graph. For more info on that see uhh . . .
[Ian Goodfellow's Book](https://www.amazon.com/Deep-Learning-NONE-Ian-Goodfellow-ebook/dp/B01MRVFGX4/ref=sr_1_3?keywords=deep+learning&qid=1565836699&s=gateway&sr=8-3).
It's a tough text.

I digress. The function creating the tensor can be accessed through `.grad_fn`.

In [3]:
y = x + 2
print(f'[+] y:\n{y}')

[+] y:
tensor([[3., 3.],
        [3., 3.]], grad_fn=<AddBackward0>)


By default, allocated tensors don't require gradients. However, if
a there is a single input to an operation that requires a gradient, 
the output will also require gradients. But, if all inputs don't
require it, the output will not. Backward computation is not performed
on subgraphs that do not require gradients. 

In [4]:
x = torch.ones((2, 2))

print(f'[+] x.requires_grad: {x.requires_grad}')

[+] x.requires_grad: False


In [5]:
# X & Y do not require gradients, but Z does. 
x = torch.randn((5, 5))
y = torch.randn((5, 5))
z = torch.randn((5, 5), requires_grad=True)

# a will not require gradients because the inputs do not
a = x + y
print(f'[+] a.requires_grad: {a.requires_grad}')

# b will require gradients becaus z does.
b = a + z
print(f'[+] b.requires_grad: {b.requires_grad}')

[+] a.requires_grad: False
[+] b.requires_grad: True


## Gradient Computation Algorithm

This will be high level. For more details see this 
[link](https://pytorch.org/docs/stable/notes/autograd.html?highlight=grad_fn).

AutoGrad is a reverse automatic differentiation system. We have a graph of 
Function objects which can be applied to perform the forward pass. That is, 
we have the root of the graph as inputs, and leave as outputs. As we perform
the forward pass, we build up a graph of the function computing the gradients.
Once the forward pass is completed, we can use this gradient graph (and the
chain rule) to compute the backward pass. 

Interestingly, the gradient graph is computed at each iteration. I'm near
positive this is not the case in Tensorflow. This is what allows for dynamic
control flow. But, it likely comes at a performance cost. 

## Removing tracing history

There are multiple ways to remove tracking history, or to temporarily
stop tracking. First, if you call `.detach()`, tracking history is removed
entirely. Additionally, wrapping computations in 

```
with torch.no_grad():
    ...
```

prevents tracking history in the context manager.

Last, you can also call `.requires_grad_()` to change it in place.

In [6]:
b = b.detach()
print(f'[+] b.requires_grad: {b.requires_grad}')

# This will raise an error, as b is not a leaf.
try:
    b.requires_grad_(False)
except RuntimeError:
    pass
    
# c won't require a gradient, even though z does.
with torch.no_grad():
    c = x + z
print(f'[+] c.requires_grad: {c.requires_grad}')

[+] b.requires_grad: False
[+] c.requires_grad: False


## Gradient Examples

As discussed previously, as we dynamically build (and compute) our
compute graph, torch builds a gradient graph which can be used for
gradient computation by the chain rule. We create a tensor, and 
perform operations on it:

$$y = x + 2$$

$$z = y \circ y \circ 3$$

Where $\circ$ is defined as the 
[Hadamard Product](https://en.wikipedia.org/wiki/Hadamard_product_(matrices))

$$out = E[z]$$

We then perform the backward pass, which gives us access to gradients
of the individual operations. For example, taking `x.grad` gives us

$$\frac{\partial out}{\partial x}$$

I'm not going to write the math out here, but I highly encourage 
anyone to read the end of the tutorial this notebook is based on. 
Torch is computing the Jacobian matrix. This is an incredibly important
thing to understand if you wish to follow the mathematics of deep learning. 

The Jacobian matrix tells us what linear transformation a non linear 
transformation looks like, in the neighborhood of a point. Because 
neural networks are applying many non-linear transformations, this
is an important result.

The last reference is by a fellow named Christopher Olah. Likely the
most recommended of this list.

References:

[Khan Academy Jacobian](https://www.khanacademy.org/math/multivariable-calculus/multivariable-derivatives/jacobian/v/the-jacobian-matrix)

[Jacobian Definition](https://en.wikipedia.org/wiki/Jacobian_matrix_and_determinant)

[Torch AutoGrad Tutorial](https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html#sphx-glr-beginner-blitz-autograd-tutorial-py)

[Neural Networks, Manifolds, and Topology](https://colah.github.io/posts/2014-03-NN-Manifolds-Topology/)

In [7]:
x = torch.ones((2, 2), requires_grad=True)
y = x + 2

print(f'[+] y.grad_fn: {y.grad_fn}')

z = y * y * 3
print(f'[+] z:\n{z}')
out = z.mean()
print(f'[+] out:\n{out}')

out.backward()

print(f'[+] x.grad:\n{x.grad}')

[+] y.grad_fn: <AddBackward0 object at 0x7f2804576390>
[+] z:
tensor([[27., 27.],
        [27., 27.]], grad_fn=<MulBackward0>)
[+] out:
27.0
[+] x.grad:
tensor([[4.5000, 4.5000],
        [4.5000, 4.5000]])
