At its core, PyTorch provides two main features:

1. An n-dimensional Tensor, similar to numpy but can run on GPUs

2. Automatic differentiation for building and training neural networks

https://github.com/jcjohnson/pytorch-examples

Ref: Linlin's blog  Understanding backward() in PyTorch

https://linlinzhao.github.io/tech/2017/10/21/Understanding-backward()-in-PyTorch.html

# Tensor does not know gradients
# Variable supports automatic gradient calculation

Thankfully, we can use automatic differentiation to automate the computation of backward passes in neural networks. The autograd package in PyTorch provides exactly this functionality. When using autograd, the forward pass of your network will define a computational graph; nodes in the graph will be Tensors, and edges will be functions that produce output Tensors from input Tensors. Backpropagating through this graph then allows you to easily compute gradients.

This sounds complicated, it's pretty simple to use in practice. We wrap our PyTorch Tensors in Variable objects; a Variable represents a node in a computational graph. If x is a Variable then x.data is a Tensor, and x.grad is another Variable holding the gradient of x with respect to some scalar value.

PyTorch Variables have the same API as PyTorch Tensors: (almost) any operation that you can perform on a Tensor also works on Variables; the difference is that using Variables defines a computational graph, allowing you to automatically compute gradients.

In [1]:
import torch

In [2]:
from torch.autograd import Variable

# Tensor <-> Variable

In [3]:
x = [ [1, 2, 3], [4, 5, 6]]

In [4]:
y = torch.FloatTensor(x)

In [5]:
z = Variable(y)

In [6]:
z

Variable containing:
 1  2  3
 4  5  6
[torch.FloatTensor of size 2x3]

In [7]:
z = Variable(y, requires_grad = True)

In [8]:
z

Variable containing:
 1  2  3
 4  5  6
[torch.FloatTensor of size 2x3]

# torch.randn

In [9]:
torch.randn(1, 1) 


 0.4562
[torch.FloatTensor of size 1x1]

 randn Returns a tensor filled with random numbers from a normal distribution with zero mean and variance of one

In [10]:
torch.randn(1)


-0.1740
[torch.FloatTensor of size 1]

In [11]:
torch.randn(4)


-1.6094
 0.5641
 0.6932
 1.6073
[torch.FloatTensor of size 4]

In [12]:
torch.randn(2, 3)


-1.8842 -0.2613  0.6636
-0.5081  2.6716 -0.3373
[torch.FloatTensor of size 2x3]

In [13]:
Variable(torch.randn(1))

Variable containing:
-0.5223
[torch.FloatTensor of size 1]

In [15]:
Variable(torch.randn(2, 3))

Variable containing:
 0.2711  2.0850  1.0181
 0.0304 -0.4553 -1.0596
[torch.FloatTensor of size 2x3]

# Variable gradients

In [16]:
x = Variable(torch.randn(1, 1), requires_grad=True) #x is a leaf created by user, thus grad_fn is none

In [17]:
x

Variable containing:
-0.6444
[torch.FloatTensor of size 1x1]

In [18]:
y = 2 * x  #define an operation on x

In [19]:
y

Variable containing:
-1.2889
[torch.FloatTensor of size 1x1]

In [20]:
z = y ** 3  #define one more operation to check the chain rule

In [21]:
z

Variable containing:
-2.1411
[torch.FloatTensor of size 1x1]

In [22]:
print('z gradient', z.grad)
print('y gradient', y.grad)
print('x gradient', x.grad) # note that x.grad is also a Variable

z gradient None
y gradient None
x gradient None


In [23]:
z.backward()

The gradient of x now becomes dz/dx = 24*square(x)

In [24]:
print('z gradient', z.grad)
print('y gradient', y.grad)
print('x gradient', x.grad) # note that x.grad is also a Variable

z gradient None
y gradient None
x gradient Variable containing:
 9.9674
[torch.FloatTensor of size 1x1]



# Use grad_variables to set learning rate

In [25]:
x = Variable(torch.randn(1, 1), requires_grad=True) #x is a leaf created by user, thus grad_fn is none
y = 2 * x
z = y ** 3

In [26]:
z.backward(torch.FloatTensor([1]), retain_graph=True) 

grad_variables should be a list of torch tensors. In default case, the backward() is applied to scalar-valued function, the default value of grad_variables is thus torch.FloatTensor([1])

In [27]:
print('z gradient', z.grad)
print('y gradient', y.grad)
print('x gradient', x.grad)

z gradient None
y gradient None
x gradient Variable containing:
 110.4875
[torch.FloatTensor of size 1x1]



In [28]:
x.grad.data.zero_()


 0
[torch.FloatTensor of size 1x1]

In [29]:
z.backward(torch.FloatTensor([0.1]), retain_graph=True) #Modifying the default value of grad_variables to 0.1 
print('z gradient', z.grad)
print('y gradient', y.grad)
print('x gradient', x.grad)

z gradient None
y gradient None
x gradient Variable containing:
 11.0488
[torch.FloatTensor of size 1x1]



# x  = a matrix.  z will also be a matrix.

In [30]:
x = Variable(torch.randn(2, 2), requires_grad=True) #x is a leaf created by user, thus grad_fn is none
print('x', x)

y = 2 * x
z = y ** 3

x Variable containing:
-0.1982  1.4902
-1.0550  0.4618
[torch.FloatTensor of size 2x2]



In [31]:
z.shape

torch.Size([2, 2])

In [32]:
z.backward(torch.FloatTensor([1, 1]), retain_graph=True)

In [33]:
x.grad

Variable containing:
  0.9431  53.2989
 26.7135   5.1188
[torch.FloatTensor of size 2x2]

Please use Excel to verify this answer

In [34]:
x.grad.data.zero_()


 0  0
 0  0
[torch.FloatTensor of size 2x2]

In [35]:
z.backward(torch.FloatTensor([1, 0]), retain_graph=True)

In [36]:
x.grad

Variable containing:
  0.9431   0.0000
 26.7135   0.0000
[torch.FloatTensor of size 2x2]

# If we render the output one-dimensional (scalar) while x is two-dimensional. This is a real simplified scenario of neural networks.

In [37]:
x = Variable(torch.randn(2, 2), requires_grad=True) #x is a leaf created by user, thus grad_fn is none
print('x', x)

y = 2 * x
z = y ** 3
out = z.mean()

x Variable containing:
-1.0168 -0.3331
 0.9356  0.1357
[torch.FloatTensor of size 2x2]



In [38]:
out

Variable containing:
-0.5337
[torch.FloatTensor of size 1]

In [39]:
out.backward(torch.FloatTensor([1]), retain_graph=True)

In [40]:
x.grad

Variable containing:
 6.2033  0.6656
 5.2515  0.1105
[torch.FloatTensor of size 2x2]

In [41]:
x.grad.data.zero_()


 0  0
 0  0
[torch.FloatTensor of size 2x2]

In [42]:
out.backward(torch.FloatTensor([0.1]), retain_graph=True)

In [43]:
x.grad

Variable containing:
 0.6203  0.0666
 0.5252  0.0110
[torch.FloatTensor of size 2x2]

# What is retain_graph doing?

When training a model, the graph will be re-generated for each iteration. Therefore each iteration will consume the graph if the retain_graph is false, in order to keep the graph, we need to set it be true.

In [44]:
x = Variable(torch.randn(2, 2), requires_grad=True) #x is a leaf created by user, thus grad_fn is none
print('x', x)

y = 2 * x
z = y ** 3
out = z.mean()

x Variable containing:
-0.2237  1.4244
 1.4300  0.4250
[torch.FloatTensor of size 2x2]



In [45]:
out

Variable containing:
 11.7597
[torch.FloatTensor of size 1]

In [46]:
out.backward(torch.FloatTensor([1]))  

Without setting retain_graph to be true, this back propogation will consume the graph. We will get an error in next iteration.

In [47]:
x.grad

Variable containing:
  0.3004  12.1738
 12.2694   1.0836
[torch.FloatTensor of size 2x2]

In [48]:
x.grad.data.zero_()


 0  0
 0  0
[torch.FloatTensor of size 2x2]

In [50]:
#out.backward(torch.FloatTensor([0.1]), retain_graph=True)

# Wrap up

1. The backward() function made differentiation very simple. It provides much flexibility for some uncommon differentiation needs.

2. For non-scalar Variables, we need to specify grad_variables.

3. If you need to backward() twice on a graph or subgraph, you will need to set retain_graph to be true, since the computation of graph will consume itself if it is false.

4. Remember that gradient for Variable will be accumulated, zero it if do not need accumulation.