# PyTorch Gradients

This section covers the PyTorch <a href='https://pytorch.org/docs/stable/autograd.html'><strong><tt>autograd</tt></strong></a> implementation of gradient descent. Tools include:
* <a href='https://pytorch.org/docs/stable/autograd.html#torch.autograd.backward'><tt><strong>torch.autograd.backward()</strong></tt></a>
* <a href='https://pytorch.org/docs/stable/autograd.html#torch.autograd.grad'><tt><strong>torch.autograd.grad()</strong></tt></a>

## Autograd - Automatic Differentiation

The PyTorch <a href='https://pytorch.org/docs/stable/autograd.html'><strong><tt>autograd</tt></strong></a> package provides automatic differentiation for all operations on Tensors. 

When a Tensor's `.requires_grad` attribute is set to True, it starts to track all operations on it. 

When an operation finishes you can call `.backward()` and have all the gradients computed automatically. The gradient for a tensor will be accumulated into its `.grad` attribute.
    
Let's see this in practice.

## Back-propagation on one step
We'll start by applying a single polynomial function $y = f(x)$ to tensor $x$. Then we'll backprop and print the gradient $\frac {dy} {dx}$.

$Function:\quad y = 2x^4 + x^3 + 3x^2 + 5x + 1 \\
Derivative:\quad y' = 8x^3 + 3x^2 + 6x + 5$

#### Step 1. Perform standard imports

In [1]:
import torch

#### Step 2. Create a tensor with <tt>requires_grad</tt> set to True

In [2]:
x = torch.tensor(2.0, requires_grad=True)
print(x)
print(x.grad) # gradient has not yet been computed

tensor(2., requires_grad=True)
None


#### Step 3. Define a function

In [3]:
y = 2*x**4 + x**3 + 3*x**2 + 5*x + 1

print(y)

tensor(63., grad_fn=<AddBackward0>)


In [4]:
y.grad_fn

<AddBackward0 at 0x15929cac0>

Since $y$ was created as a result of an operation, it has an associated gradient function accessible as <tt>y.grad_fn</tt><br>
The calculation of $y$ is done as:<br>

$\quad y=2(2)^4+(2)^3+3(2)^2+5(2)+1 = 32+8+12+10+1 = 63$

This is the value of $y$ when $x=2$.

#### Step 4. Backprop

In [5]:
#perform backpropagation and compute all gradients
y.backward()

#### Step 5. Display the resulting gradient

In [6]:
print(x.grad)

tensor(93.)


Note that <tt>x.grad</tt> is an attribute of tensor $x$, so we don't use parentheses. The computation is the result of<br>

$\quad y'=8(2)^3+3(2)^2+6(2)+5 = 64+12+12+5 = 93$

This is the slope of the polynomial at the point $(2,63)$.

## Back-propagation on multiple steps
Now let's do something more complex, involving layers $y$ and $z$ between $x$ and our output layer $out$.

#### 1. Create a tensor

In [7]:
x = torch.tensor([[1.,2,3],[3,2,1]], requires_grad=True)
print(x)

tensor([[1., 2., 3.],
        [3., 2., 1.]], requires_grad=True)


#### 2. Create the first layer with $y = 3x+2$

In [8]:
y = 3*x + 2
print(y)

tensor([[ 5.,  8., 11.],
        [11.,  8.,  5.]], grad_fn=<AddBackward0>)


#### 3. Create the second layer with $z = 2y^2$

In [9]:
z = 2*y**2
print(z)

tensor([[ 50., 128., 242.],
        [242., 128.,  50.]], grad_fn=<MulBackward0>)


#### 4. Set the output to be the matrix mean

In [10]:
out = z.mean()
print(out)

tensor(140., grad_fn=<MeanBackward0>)


#### 5. Now perform back-propagation to find the gradient of x w.r.t out

In [11]:
out.backward()
print(x.grad)

tensor([[10., 16., 22.],
        [22., 16., 10.]])


You should see a 2x3 matrix. 

If we call the final <tt>out</tt> tensor "$o$", we can calculate the partial derivative of $o$ with respect to $x_i$.  To solve the derivative we use the <a href='https://en.wikipedia.org/wiki/Chain_rule'>chain rule</a>

$$\frac{\partial o}{\partial x} = \frac{\partial o}{\partial z} * \frac{\partial z}{\partial y} * \frac{\partial y}{\partial x}$$

In this case<br>

$$o = \frac {1} {6}\sum_{i=1}^{6} z_i$$

$$z_i = 2(y_i)^2$$

$$y_i = 3x_i+2$$

Thus <br>

$$
\begin{align} 
    \frac{\partial o}{\partial z_i} &= \frac{1}{6}\\
    \frac{\partial z}{\partial y_i} &= 4(y_i) = 4(3x_i + 2)\\
    \frac{\partial y}{\partial x_i} &= 3\\
\end{align}
$$

Thus 

$$
\begin{align}
    \frac{\partial o}{\partial x} &= \frac{\partial o}{\partial z} * \frac{\partial z}{\partial y} * \frac{\partial y}{\partial x}\\
    \frac{\partial o}{\partial x} &= \frac{1}{6} * 4(3x_i + 2) * 3\\
    &= 2(3x_i + 2)
\end{align}
$$

Therefore,<br>

$\frac{\partial o}{\partial x_i} = \frac{1}{6}\times 12(3x+2)$<br>

$\frac{\partial o}{\partial x_i}\bigr\rvert_{x_i=1} = 2(3(1)+2) = 10$

$\frac{\partial o}{\partial x_i}\bigr\rvert_{x_i=2} = 2(3(2)+2) = 16$

$\frac{\partial o}{\partial x_i}\bigr\rvert_{x_i=3} = 2(3(3)+2) = 22$

### Turn off tracking

There may be times when we don't want or need to track the computational history.

You can reset a tensor's `requires_grad` attribute.

When performing evaluations, it's often helpful to wrap a set of operations in `with torch.no_grad():`

A less-used method is to run `.detach()` on a tensor to prevent future computations from being tracked.

This can be handy when cloning a tensor.

In [12]:
# tests
Wh = torch.randn(3,3, requires_grad = True)
Wx = torch.randn(3,2, requires_grad = True)
h = torch.randn(1,3)
x = torch.randn(1,2)

wh2h = torch.mm(Wh,h.t())
wx2x = torch.mm(Wx,x.t())
print(wh2h)
print(wx2x)

next_h = wh2h + wx2x + 5
next_h = next_h.tanh()
print(next_h)

loss = next_h.sum()
print(loss)

# backprop
loss.backward()

# stores gradients when requires_grad = True
print(Wx.grad)
print(Wh.grad)
# does not store gradients when requires_grad = False
print(x.grad)
print(h.grad)

tensor([[-1.8642],
        [-1.1607],
        [ 0.1536]], grad_fn=<MmBackward0>)
tensor([[-1.1271],
        [ 1.1491],
        [-1.6593]], grad_fn=<MmBackward0>)
tensor([[0.9646],
        [0.9999],
        [0.9982]], grad_fn=<TanhBackward0>)
tensor(2.9627, grad_fn=<SumBackward0>)
tensor([[-0.1199,  0.0583],
        [-0.0003,  0.0002],
        [-0.0064,  0.0031]])
tensor([[ 0.0799,  0.0740, -0.0679],
        [ 0.0002,  0.0002, -0.0002],
        [ 0.0042,  0.0039, -0.0036]])
None
None
