# PyTorch Gradients
Notebook inspired from the Udemy course "PyTorch for Deep Learning with Python Bootcamp".

This section covers the PyTorch <a href='https://pytorch.org/docs/stable/autograd.html'><strong><tt>autograd</tt></strong></a> implementation of gradient descent. 

## Autograd - Automatic Differentiation
When training neural networks, the most frequently used algorithm is
**back propagation**. In this algorithm, parameters (model weights) are
adjusted according to the **gradient** of the loss function with respect
to the given parameter.

To compute those gradients, PyTorch has a built-in differentiation engine
called ``torch.autograd``. It supports automatic computation of gradient for any
computational graph.
    
Let's see this in practice.

## Back-propagation on one step
We'll start by applying a single polynomial function $y = f(x)$ to tensor $x$. Then we'll backprop and print the gradient $\frac {dy} {dx}$.

$\begin{split}Function:\quad y &= 2x^4 + x^3 + 3x^2 + 5x + 1 \\
Derivative:\quad y' &= 8x^3 + 3x^2 + 6x + 5\end{split}$

#### Step 1. Perform standard imports

In [2]:
import torch 

#### Step 2. Create a tensor with <tt>requires_grad</tt> set to True
This sets up computational tracking on the tensor.

In [3]:
x = torch.tensor(2.0, requires_grad=True)

#### Step 3. Define a function

In [4]:
y = 2*x**4 + x**3 + 3*x**2 + 5*x + 1

print(y)

tensor(63., grad_fn=<AddBackward0>)


Since $y$ was created as a result of an operation, it has an associated gradient function accessible as <tt>y.grad_fn</tt><br>
The calculation of $y$ is done as:<br>

$\quad y=2(2)^4+(2)^3+3(2)^2+5(2)+1 = 32+8+12+10+1 = 63$

This is the value of $y$ when $x=2$.

#### Step 4. Backprop

In [5]:
y.backward()

#### Step 5. Display the resulting gradient

In [6]:
print(x.grad)

tensor(93.)


Note that <tt>x.grad</tt> is an attribute of tensor $x$, so we don't use parentheses. The computation is the result of<br>

$\quad y'=8(2)^3+3(2)^2+6(2)+5 = 64+12+12+5 = 93$

This is the slope of the polynomial at the point $(2,63)$.

## Back-propagation on multiple steps

### EXERCISE:
Now let's do something more complex, involving layers $y$ and $z$ between $x$ and our output layer $out$.

Write a code that compute the following instructions:
* Create a tensor <tt>x</tt> of size = (2,3)
* Create the first layer with $y = 3x+2$
* Create the second layer with $z = 2y^2$
* Set the output equal to the matrix mean <tt>z.mean()</tt>
* Now perform back-propagation to find the gradient of x w.r.t out <tt>x.grad</tt>
* Compute the derivative manually and verify that you obtain the correct result


In [14]:
# YOUR CODE HERE
x = torch.tensor([[1.,2,3],[3,2,1]], requires_grad=True)
print(x)
print(x.grad)
# x = torch.Tensor([[1.,2,3],[3,2,1]])
# x.requires_grad_()
# requires_grad_() 
y = 3 * x + 2
y.retain_grad()
z = 2 * y**2
out = z.mean()
out.backward()
print(x.grad)
print(1/6 * 4 * y * 3)
print(y.grad)

tensor([[1., 2., 3.],
        [3., 2., 1.]], requires_grad=True)
None
tensor([[10., 16., 22.],
        [22., 16., 10.]])
tensor([[10., 16., 22.],
        [22., 16., 10.]], grad_fn=<MulBackward0>)
tensor([[3.3333, 5.3333, 7.3333],
        [7.3333, 5.3333, 3.3333]])


## Example: one-layer neural network
Consider the simplest one-layer neural network, with input ``x``,
parameters ``w`` and ``b``, and some loss function. 

It can be defined in PyTorch in the following manner:

In [20]:
x = torch.ones(5)  # input tensor
y = torch.zeros(3)  # expected output
w = torch.randn(5, 3, requires_grad=True)
b = torch.randn(3, requires_grad=True)
z = torch.matmul(x, w)+b
loss = torch.nn.functional.binary_cross_entropy_with_logits(z, y)

In this network, ``w`` and ``b`` are **parameters**, which we need to
optimize. Thus, we need to be able to compute the gradients of loss
function with respect to those variables. In order to do that, we set
the ``requires_grad`` property of those tensors.



<div class="alert alert-info"><h4>Note</h4><p>You can set the value of ``requires_grad`` when creating a
          tensor, or later by using ``x.requires_grad_(True)`` method.</p></div>



A function that we apply to tensors to construct computational graph is
in fact an object of class ``Function``. This object knows how to
compute the function in the *forward* direction, and also how to compute
its derivative during the *backward propagation* step. A reference to
the backward propagation function is stored in ``grad_fn`` property of a
tensor. You can find more information of ``Function`` [in the
documentation](https://pytorch.org/docs/stable/autograd.html#function)_.




In [21]:
print(f"Gradient function for z = {z.grad_fn}")
print(f"Gradient function for loss = {loss.grad_fn}")

Gradient function for z = <AddBackward0 object at 0x7f4e9bfa04d0>
Gradient function for loss = <BinaryCrossEntropyWithLogitsBackward object at 0x7f4e9bfa0610>


### Computing Gradients

To optimize weights of parameters in the neural network, we need to
compute the derivatives of our loss function with respect to parameters,
namely, we need $\frac{\partial loss}{\partial w}$ and
$\frac{\partial loss}{\partial b}$ under some fixed values of
``x`` and ``y``. To compute those derivatives, we call
``loss.backward()``, and then retrieve the values from ``w.grad`` and
``b.grad``:




In [22]:
loss.backward()
print(w.grad)
print(b.grad)

tensor([[0.1075, 0.2615, 0.2888],
        [0.1075, 0.2615, 0.2888],
        [0.1075, 0.2615, 0.2888],
        [0.1075, 0.2615, 0.2888],
        [0.1075, 0.2615, 0.2888]])
tensor([0.1075, 0.2615, 0.2888])


<div class="alert alert-info"><h4>Note</h4><p>- We can only obtain the ``grad`` properties for the leaf
    nodes of the computational graph, which have ``requires_grad`` property
    set to ``True``. For all other nodes in our graph, gradients will not be
    available.
  - We can only perform gradient calculations using
    ``backward`` once on a given graph, for performance reasons. If we need
    to do several ``backward`` calls on the same graph, we need to pass
    ``retain_graph=True`` to the ``backward`` call.</p></div>




### Disabling Gradient Tracking

By default, all tensors with ``requires_grad=True`` are tracking their
computational history and support gradient computation. However, there
are some cases when we do not need to do that, for example, when we have
trained the model and just want to apply it to some input data, i.e. we
only want to do *forward* computations through the network. We can stop
tracking computations by surrounding our computation code with
``torch.no_grad()`` block:




In [23]:
z = torch.matmul(x, w)+b
print(z.requires_grad)

with torch.no_grad():
    z = torch.matmul(x, w)+b
print(z.requires_grad)

True
False


Another way to achieve the same result is to use the ``detach()`` method
on the tensor:




In [24]:
z = torch.matmul(x, w)+b
z_det = z.detach()
print(z_det.requires_grad)

False


There are reasons you might want to disable gradient tracking:
  - To mark some parameters in your neural network as **frozen parameters**. This is
    a very common scenario for
    [finetuning a pretrained network](https://pytorch.org/tutorials/beginner/finetuning_torchvision_models_tutorial.html)_
  - To **speed up computations** when you are only doing forward pass, because computations on tensors that do
    not track gradients would be more efficient.



### What next: the torch.nn module
Now we know how to make Pytorch compute gradients. So, in principle, if we want to build a neural network in PyTorch, we could specify all our parameters (weight matrices, bias vectors) using `Tensors` (with `requires_grad=True`), ask PyTorch to calculate the gradients and then adjust the parameters. But things can quickly get cumbersome if we have a lot of parameters. In PyTorch, there is a package called `torch.nn` that makes building neural networks more convenient. 

In the next section we will explore how to use this module to build and train our neural network.