# Automatic Differentiation with Torch.AUTOGRAD
When trining neural networks, the most frequently used algorithm is back propagation. In this algorithm, parameters are adjucted according to the gradient of the loss function with respect to the given parameter. To compue those gradients, PyTorch has a built in differentiation engine called torch.autograd. It supports automatic computation of gradient for any computational graph. Consider the simplest one-Layer neural network, with input x, parameters w and b and some loss function. It can be defined in PyTorch in the following manner

In [1]:
import torch 
x = torch.ones(5) # input tensor
y = torch.zeros(3) # exected output
w = torch.randn(5,3,requires_grad=True)
b = torch.randn(3,requires_grad=True)
z = torch.matmul(x,w)+b
loss = torch.nn.functional.binary_cross_entropy_with_logits(z,y)

## Tensors,Functions and Computational graph
you can set the value of requires_grad when creating a tensor, or later by using x.requires_grad_(True) method. Afunctio that we apply to tensors to construct computational graph is in fact an object of class Function. This object knows how to compute the function in the forward direction, and also how to compute its derivative during the backpropagation step. A reference to the backward propagation function is stores in grad_fn propery of a tensor. 

In [4]:
print('Gradient function for z =', z.grad_fn)
print('Gradient function for loss = ',loss.grad_fn)

Gradient function for z = <AddBackward0 object at 0x000001CA9D8CAEB0>
Gradient function for loss =  <BinaryCrossEntropyWithLogitsBackward object at 0x000001CA9D8CAE20>


## Note 
We can only obtain the grad properties for the leaf noedes of the computtional graph, which have requires_grad property set to True. For all other nodes in our graph, gradients will now be available . We can only perfrom gradient calculations using backward once on a given graph, for performacne resons. If we need to do several backward calls on the same graph, we need to pass reatin_graph=True to the backward call

## Disabling Gradient Tracking 
By default, all tensors wil requires_grad = True are tracking their computational history and support gradient computation. However, there are some cases when we do not need to do that, for example, when we have trained the model and just want to apply it to some input data, i.e we only want to do forward computations through the network. We can stop tracking computations by surrounding or computation code with torch.no_grad() block

In [5]:
z = torch.matmul(x,w)+b
print(z.requires_grad)
with torch.no_grad():
    z = torch.matmul(x,w)+b
print(z.requires_grad)

True
False


Another way to achieve the same result is to use the detech() method on the tensor 

In [6]:
z = torch.matmul(x,w)+b
z_det = z.detach()
print(z_det.requires_grad)

False


There are reasons you might want to disable gradient checking. To mark some parameters in your neural network at frozen parameters. This is a very common scenario for finetuning a pretrained netwwork. T speed up computations when you are only doing forward pass, because computations on terms that do not track gradients would be more efficient 

# More on Computational Graphs
Conceptually, autograd keeps a record of data and all executed operations in a directed acyclic graph consisting of Function objects. In this DAG, leaves are the input tensors roots are the output tensors. By traching this graph from roots to leaves, you can auomatically compute the gradients using the charin rule. In a forward pass autograd does two things simultaneously: run the requested operation to compute a reauling tensor. Maintaining the operations gradient function in the DAG. The backward pass kicks off when .backward() is called on the DAG root. autograd then: computes the gradients from each .grad_fn, accumuated them in the respective tensors .grad attribute. Using the chain tule propagates all the way to the leaf tensors. 
## Note

DAGs are dynamic in PyTorch An important thing to note is that the graph is recreated from scratch; after each .backward() call, autograd starts population a new graph. This is exactly what allows you to use control flow statements in your model; you can change the shape, size and operations at every iteration if needed.
# Optional Reading: Tensor Gradients and Jacobian Products 
In many cases, we have a scalar loss function, and we need to compute the gradient with respect to some parameters. Howver, there are cases when the output function is an arbitary tensor. In this case, PyTorch allows you to compute so-called Jacobian prodduct and not the actual gradient. 

In [7]:
inp = torch.eye(5, requires_grad=True)
out = (inp+1).pow(2)
out.backward(torch.ones_like(inp),retain_graph=True)
print('First call\n',inp.grad)
out.backward(torch.ones_like(inp),retain_graph=True)
print('\n Secondcall\n',inp.grad)
inp.grad.zero_()
out.backward(torch.ones_like(inp),retain_graph=True)
print('\n Call after zeroing gradients \n',inp.grad)

First call
 tensor([[4., 2., 2., 2., 2.],
        [2., 4., 2., 2., 2.],
        [2., 2., 4., 2., 2.],
        [2., 2., 2., 4., 2.],
        [2., 2., 2., 2., 4.]])

 Secondcall
 tensor([[8., 4., 4., 4., 4.],
        [4., 8., 4., 4., 4.],
        [4., 4., 8., 4., 4.],
        [4., 4., 4., 8., 4.],
        [4., 4., 4., 4., 8.]])

 Call after zeroing gradients 
 tensor([[4., 2., 2., 2., 2.],
        [2., 4., 2., 2., 2.],
        [2., 2., 4., 2., 2.],
        [2., 2., 2., 4., 2.],
        [2., 2., 2., 2., 4.]])


Notice that when we call backward for the second time with the same argument, the value of the gradient is different. This happens because when doing backward propagation, PyTorch accumulates the gradients, i.e the value of computed gradients is added to the grad property of all leaf nodes of computational graph. If you want to compue the proper gradients you need to zero out the grad property before. In real life training an optimizer helps us to do this. 
## Note: 
Previously we were calling backward() function without parameters. This is essentially equivalent to calling backward(torch.tensor(1.0)), which is a useful way to compue the gradients in case of a scalr-valued function such as loss during training a neural network 