# PyTorch - AutoGrad

In [2]:
import torch

def _print(val):
    print(val,'\n')

In [7]:
x = torch.ones(2,2, requires_grad=True)
y = x + 2
_print(x.grad_fn) #first Tensor in graph has no grad_fn
_print(y.grad_fn)

None 

<AddBackward0 object at 0x0000023BFACF1EB8> 



In [8]:
z = y*y*3
out = z.mean()
print(z,out)

tensor([[27., 27.],
        [27., 27.]], grad_fn=<MulBackward0>) tensor(27., grad_fn=<MeanBackward0>)


If you want to compute the derivatives, you can call .backward() on a Tensor. If Tensor is a scalar (i.e. it holds a one element data), you don’t need to specify any arguments to backward(), however if it has more elements, you need to specify a gradient argument that is a tensor of matching shape.

In [9]:
_print(y)
out.backward()
_print(x.grad) #doutdx = (1/4)*6*3 = 4.5
_print(y.grad) #dydx  = 1
_print(z.grad)  #dzdy = 6y
_print(out.grad) #doutdz = 1/z.shape[0] = 1/4

tensor([[3., 3.],
        [3., 3.]], grad_fn=<AddBackward0>) 

tensor([[4.5000, 4.5000],
        [4.5000, 4.5000]]) 

None 

None 

None 



The derivative of a vector valued function z = f(y) is the __Jacobian Matrix.__ 
- ((dz1/dy1, ... , dz1/dyn)
- (dz2/dy1, ... , dz2/dyn)
- (       , ... ,        )
- ( dzn/dy1, ... , dzn/dyn))
 
 Autograd is a engine for computing a vector - Jacobian product. At each step in the the jacobian of the Function is multiplied by the Gradient-vector of the previous step.
 

## Autograd Mechanics

[link](https://pytorch.org/docs/stable/notes/autograd.html)

Understanding Autograd leads to more efficient code.Tensors have __.requires_grad__ atrribute that allows exclusion of subgraphs from gradient computation. If a Function has an input that requires_gradient, the output will also require_gradient.

You can freeze some parameters to make them untrainable -> makes the entire training process more memory and computationally efficient. You can freeze the encoder and only require gradients for the encoder.

Autograd creates a directed acyclic graph with leaves are the input tensors and roots are the output tensors. Graph of Function objects (which can be .aaply()ed to the inputs) During forward pass, builds up a graph representing the function that computes the gradient. 

The graph is computed from scratch at each iteration, allowing control flow statements (if,then,else) You can change the size of the graph at every iteration, what is executed is differentiated.

### In-Place Operations
In-place operations are discouraged. They dont help memory usage much. Autograd has buffering and aggressive memory freeing schedules that make memory management very efficient.

- In-place ops can overwrite values required for gradient calculation
- Every in-place op requires a complete rewrite of the computational graph. Out-of-place ops create new objects that reference the old graph 

*"while in-place operations, require changing the creator of all inputs to the Function representing this operation. This can be tricky, especially if there are many Tensors that reference the same storage (e.g. created by indexing or transposing), and in-place functions will actually raise an error if the storage of modified inputs is referenced by any other Tensor."*



## Variables (Deprecated)

__Variable__ wraps a Tensor. Support all Tensor methods, as well as the __.backward()__ method to perform backpropagation. Have some variable that __requires_grad__, that leads to computation of the loss function. Loss.backward() will compute the gradient of the loss function w.r.t the trainable parameteres

In [1]:
import torch
from torch.autograd import Variable

X = Variable(torch.ones(2,2),)
print(X)
print(X.requires_grad) #variables do not default .requires_grad=True
X.requires_grad = True
print(X.requires_grad)

tensor([[1., 1.],
        [1., 1.]])
False
True


To train a part of a pretrained model, set __.requires_grad = True__ at the entrance of the subgraph to be trained. 

Add a Function to create another Variable.

In [25]:
y = X + 2 #dx= 1
print(y)

tensor([[3., 3.],
        [3., 3.]], grad_fn=<AddBackward0>)


In [26]:
z = 2*(y**2) #dy = 4*y
print(z)
out = z.mean() #dz =1/n = 1/4
print(out)

tensor([[18., 18.],
        [18., 18.]], grad_fn=<MulBackward0>)
tensor(18., grad_fn=<MeanBackward0>)


### Compute Gradient
Starts computation of gradient at Variable.backward() (this is usually the Cost function in Deep Learning)

In [27]:
#dout/dxi = 1/4 * 4(y) = xi + 2

In [28]:
out.backward()
print(X.grad)

tensor([[3., 3.],
        [3., 3.]])


In [34]:
X = Variable(torch.ones(2,2),requires_grad=True)
y = X + 2
z = 2*(y**3)
out = z.mean()
out.backward()
#dout/dxi = 1/4(dz/dx) = 1/4(6*y**2)(dy/dx) = 1/4(6*(X+2)**2)
# = 1.5(9) = 13.5
print(X.grad)

tensor([[13.5000, 13.5000],
        [13.5000, 13.5000]])


In [39]:
X = Variable(torch.ones(3,3),requires_grad=True)
y = 4*X + 2
z = 8*(y**2)
out = z.mean()
out.backward()
#dout/dxi = 1/9(dz/dx) = 1/9(16y)(dy/dx) = 1/9(16*(4*X+2))(4)
# = 42.667
print(X.grad)

tensor([[42.6667, 42.6667, 42.6667],
        [42.6667, 42.6667, 42.6667],
        [42.6667, 42.6667, 42.6667]])


## Dynamic Computation Graph

In [76]:
dtype = torch.FloatTensor
N, Din, H, Dout = 64, 1000, 100, 10

X = Variable(torch.randn(N, Din).type(dtype), requires_grad=False)
y = Variable(torch.randn(N, Dout).type(dtype), requires_grad=False)

w1 = Variable(torch.randn(Din, H).type(dtype), requires_grad=True)
w2 = Variable (torch.randn(H, Dout).type(dtype), requires_grad=True)

learning_rate = 0.00001
for i in range(2):
    y_pred = X.mm(w1).clamp(min=0).mm(w2)
    print(y_pred.shape)
    
    loss = (y_pred - y).pow(2).sum()
    
    if i%100==0:
        print(loss.data)
    
    loss.backward()

    
    w1.data -= learning_rate * w1.grad.data
    w2.data -= learning_rate * w2.grad.data
    
    #reset gradients!!
    w1.grad.data.zero_()
    w2.grad.data.zero_()
    
    

torch.Size([64, 10])
tensor(29824080.)
torch.Size([64, 10])


### Extending Function

__Function__ *Records operation history and defines formulas for differentiating ops.*

*Every operation performed on Tensor s creates a new function object, that performs the computation, and records that it happened. The history is retained in the form of a DAG of functions, with edges denoting data dependencies (input <- output). Then, when backward is called, the graph is processed in the topological ordering, by calling backward() methods of each Function object, and passing returned gradients on to next Function s.*

*Normally, the only way users interact with functions is by creating subclasses and defining new operations. This is a recommended way of extending torch.autograd.*

*Each function object is meant to be used only once (in the forward pass).*

In [None]:
#save whichever tensor you need for the derivative function

class relu(torch.autograd.Function):
    
    @staticmethod
    def forward(ctx, input):
        ctx.save_for_backward(input)
        return input.clamp(min=0)
    
    #staticmethod
    def backward(ctx, grad_output):
        input, = ctx.saved_tensors
        grad_input = grad_output.clone()
        grad_input[input<0] = 0
        return grad_input
    
    
class Exp(Function):

    @staticmethod
    def forward(ctx, i):
        result = i.exp()
        ctx.save_for_backward(result)
        return result

    @staticmethod
    def backward(ctx, grad_output):
        result, = ctx.saved_tensors
        return grad_output * result

## Access Data
Access data of a tensor with __tensor.data__

In [66]:
z.data

tensor([[288., 288., 288.],
        [288., 288., 288.],
        [288., 288., 288.]])

## Locally Disabling Gradient computation

In [74]:
print(z.requires_grad)
with torch.no_grad():
    h = z*2
    print(h.requires_grad)

True
False


## Profiler
Autograd includes a profiler that lets you inspect the cost of different operators inside your model - both on the CPU and GPU. There are two modes implemented at the moment - CPU-only using profile. and nvprof based (registers both CPU and GPU activity) using emit_nvtx.

In [24]:
with torch.autograd.profiler.profile() as prof:
    X = Variable(torch.ones(3,3),requires_grad=True)
    y = torch.mul(X, 4) + 2
    z = 8*(y**2)
    out = z.mean()
print(prof.key_averages().table(sort_by='self_cpu_time_total'))

--------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Name      Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  
--------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
mul       46.43%           84.253us         46.43%           84.253us         42.127us         NaN              0.000us          0.000us          2                
pow       22.32%           40.506us         22.32%           40.506us         40.506us         NaN              0.000us          0.000us          1                
add       16.07%           29.165us         16.07%           29.165us         29.165us         NaN              0.000us          0.000us          1                
mean      15.18