In [1]:
import torch

x = torch.ones(5) # input tensor
y = torch.zeros(3) # expected output
w = torch.randn(5, 3, requires_grad=True)
b = torch.rand(3, requires_grad=True)
z = torch.matmul(x, w)+b
loss = torch.nn.functional.binary_cross_entropy_with_logits(z,y)

Binary Cross Entropy:<br/>
![](BCE.png)

This code defines the following computational graph:<br/>
![](comp-graph.png)

In this network, w and b are parameters, which we need to optimize. Thus, we need to be able to compute the gradients of loss function with respect to those variables. In order to do that, we set the requires_grad property of those tensors.<br/>
You can set the value of requires_grad later by using x.requires_grad_(True) method.

A reference to the backward propagation function is stored in grad_fn property of a tensor.

In [2]:
print('Gradient function for z =', z.grad_fn)
print('Gradient function for loss =', loss.grad_fn)

Gradient function for z = <AddBackward0 object at 0x7f74205bc190>
Gradient function for loss = <BinaryCrossEntropyWithLogitsBackward object at 0x7f742061c2b0>


Computing Gradients

In [3]:
loss.backward()
print(w.grad)
print(b.grad)

tensor([[0.0072, 0.1718, 0.2829],
        [0.0072, 0.1718, 0.2829],
        [0.0072, 0.1718, 0.2829],
        [0.0072, 0.1718, 0.2829],
        [0.0072, 0.1718, 0.2829]])
tensor([0.0072, 0.1718, 0.2829])


We can only perform gradient calculations using backward once on a given graph, for performance reasons. If we need to do several backward calls on the same graph, we need to pass retain_graph=True to the backward call.

Disabling Gradient Tracking:<br/>
By default, all tensors with requires_grad=True are tracking their computational history and support gradient computation. However, there are some cases when we do not need to do that, for example, when we have trained the model and just want to apply it to some input data, i.e. we only want to do forward computations through the network. We can stop tracking computations by surrounding our computation code with torch.no_grad() block or using detach():

In [4]:
z = torch.matmul(x, w)+b
print(z.requires_grad)

z_det = z.detach()
print(z_det.requires_grad)

with torch.no_grad():
    z = torch.matmul(x, w)+b
print(z.requires_grad)


True
False
False


There are reasons you might want to disable gradient tracking:<br/>
1. To mark some parameters in your neural network at frozen parameters. This is a very common scenario for finetuning a pretrained network
2. To speed up computations when you are only doing forward pass, because computations on tensors that do not track gradients would be more efficient.

More on Computational Graphs
Conceptually, autograd keeps a record of data (tensors) and all executed operations (along with the resulting new tensors) in a directed acyclic graph (DAG) consisting of Function objects. In this DAG, leaves are the input tensors, roots are the output tensors. By tracing this graph from roots to leaves, you can automatically compute the gradients using the chain rule.

In a forward pass, autograd does two things simultaneously:

run the requested operation to compute a resulting tensor<br/>
maintain the operation’s gradient function in the DAG.<br/>
The backward pass kicks off when .backward() is called on the DAG root. autograd then:

computes the gradients from each .grad_fn,<br/>
accumulates them in the respective tensor’s .grad attribute<br/>
using the chain rule, propagates all the way to the leaf tensors.<br/>

DAGs are dynamic in PyTorch An important thing to note is that the graph is recreated from scratch; after each .backward() call, autograd starts populating a new graph. This is exactly what allows you to use control flow statements in your model; you can change the shape, size and operations at every iteration if needed.