02 PyTorch Autograd
=====================
### Date: Jan 9 2018
### Author: Farahana

In [1]:
import torch as tc
from torch.autograd import Variable

Let us learn simple use of autograd.

In [5]:
x = Variable(tc.Tensor([1]), requires_grad=True) #define x =1
w = Variable(tc.Tensor([2]), requires_grad=True) #define w =2
b = Variable(tc.Tensor([3]), requires_grad=True) #define b =3

In [7]:
y = w * x + b 

In [8]:
y.backward() # to compute gradient of y

In [15]:
print(x.grad_fn) # x is created by us, so no grad_fn

None


In [16]:
# x has the gradient, resulted from y.backward() 
# and when the flag of requires_grad is true
print(x.grad) 

Variable containing:
 2
[torch.FloatTensor of size 1]



In [12]:
print(y.grad_fn) # y is a function

<AddBackward1 object at 0x7fd599910240>


In [14]:
print(y.grad) # y is not created by us, thus no grad

None


* Autograd is an automatic differentiation module in pytorch. 
* It automates the backward computation in neural network. 
* Imagine a computaional graph with nodes and edges. 
Nodes are the tensors while edges are the functions of computations.

Let us recap the previous example in [**01 Introduction into PyTorch**](01-pytorch.ipynb) .

* **N** is the batch size
* **D_i** is the input dimension while **D_out** is the output dimension
* **H** is the hidden dimension

In [23]:
# Initialization of the example
N, D_in, H, D_out = 24, 1000, 100, 4
learning_rate = 1e-6

dtype = tc.cuda.FloatTensor
# dtype = tc.FloatTensor

Defining the input and expected output which the gradient is not required. 

In [21]:
x = Variable(tc.randn(N, D_in).type(dtype), requires_grad=False) # input
y = Variable(tc.randn(N, D_out).type(dtype), requires_grad=False) # output

Initializing weights randomly and set for gradient requirement 

In [22]:
w1 = Variable(tc.randn(D_in, H).type(dtype), requires_grad=True) # weights_1
w2 = Variable(tc.randn(H, D_out).type(dtype), requires_grad=True) # weights_2

In [25]:
for t in range(200):
    # As weigths had the True flagged for gradient requirement, 
    # computations do not need references of backward pass, 
    # the computation of gradients are automated
    y_pred = x.mm(w1).clamp(min=0).mm(w2)
    
    # Initialized the loss to get the gradients
    loss = (y_pred - y).pow(2).sum()
    # The loss has become of the Variables as it uses y and y_pred in computation. 
    # .grad_fn and .backward()is available.
    print(t, loss.data[0])
    
    # The supposed backward computations to get the gradients
    # grad_y_pred = 2.0 * (y_pred - y)
    # grad_w2 = h_relu.t().mm(grad_y_pred)
    # grad_h_relu = grad_y_pred.mm(w2.t())
    # grad_h = grad_h_relu.clone()
    # grad_h[h < 0] = 0
    # grad_w1 = x.t().mm(grad_h)
    loss.backward()
    
    # Gradients only available for weights because only Weights (w1 and w2) got the reguires_grad as True
    # Update the weights by substracting the gradients: w = w - learning_rate*w_gradients
    w1.data -= learning_rate * w1.grad.data
    w2.data -= learning_rate * w2.grad.data
    
    # Zeroing the gradients after its use
    w1.grad.data.zero_()
    w2.grad.data.zero_()

0 7644135.5
1 2262797.5
2 1502437.0
3 1022621.3125
4 708579.5625
5 498156.8125
6 354510.90625
7 254962.6875
8 185074.8125
9 135411.265625
10 99775.046875
11 73982.2421875
12 55163.1484375
13 41348.67578125
14 31142.341796875
15 23559.6015625
16 18010.763671875
17 13844.4775390625
18 10684.201171875
19 8275.9228515625
20 6435.82080078125
21 5022.36572265625
22 3930.627685546875
23 3084.525390625
24 2426.66162109375
25 1913.693603515625
26 1512.5338134765625
27 1197.9970703125
28 951.3430786132812
29 757.2006225585938
30 603.8499755859375
31 482.4577331542969
32 386.1450500488281
33 309.58251953125
34 248.59622192382812
35 199.9409942626953
36 161.04180908203125
37 129.8955841064453
38 104.91561126708984
39 84.84761047363281
40 68.7056884765625
41 55.702171325683594
42 45.210269927978516
43 36.7368049621582
44 29.884475708007812
45 24.333560943603516
46 19.833486557006836
47 16.1809024810791
48 13.214195251464844
49 10.798812866210938
50 8.834538459777832
51 7.231388092041016
52 5.925728

Then, let us examine the expected output vs the predicted output.

In [26]:
print(y,y_pred)

Variable containing:
-0.0588 -0.1549 -0.9363 -0.6117
-0.5614  0.2786 -0.6237  0.9257
 0.4124  1.0885 -0.3523 -0.1353
-0.6585  1.9010  0.4739  0.3442
 0.7310  0.3233 -0.5113 -0.4284
-0.4787  0.6141 -0.1065 -0.7604
 0.9709  0.5020  1.8626 -1.3757
-0.5717  0.4749  1.9955  0.1409
 0.1185 -2.0504 -1.4487  1.1535
 2.2414 -0.0043  0.6162 -0.4058
 2.0237 -0.5026 -1.1217  0.5577
 1.7662  0.1556  0.2713  0.9467
-0.6436  0.4181  0.8284 -0.9558
 0.5169 -0.7242  1.9608 -0.7905
 0.2504 -0.1324 -0.3770  1.2509
 0.4430 -1.8406 -0.4133 -0.8317
 2.4253  0.2787 -0.0510 -0.3843
-1.6114  1.1748  1.9463  1.1621
-0.3994 -0.5286  0.7286 -0.1246
-1.0058 -0.8582 -0.5995 -0.6680
 2.8242  0.3650  0.3401 -1.3912
-1.0736 -0.7527  0.4383  0.3862
 0.0242  0.2578  0.5763 -0.1229
 1.1378 -0.3710  0.4106 -0.5006
[torch.cuda.FloatTensor of size 24x4 (GPU 0)]
 Variable containing:
-0.0588 -0.1548 -0.9366 -0.6116
-0.5614  0.2785 -0.6237  0.9255
 0.4125  1.0884 -0.3521 -0.1355
-0.6585  1.9009  0.4739  0.3441
 0.7311  0.3234

Though it seems much easier to use Autograd with neural network, 
PyTorch has another module for neural network that makes life easier.
Where we will discuss later in another part.
***

### Defining other type of gradients with autograd.Function ###
We have to define new **forward** and **backward** functions into new defined *subclass* of torch.autograd.Function 

In [33]:
class MyReLU(tc.autograd.Function):
    """
    Relu is defined
    """
    def forward(self, input):
        """
        In the forward pass we receive a Tensor containing the input and return a
        Tensor containing the output. You can cache arbitrary Tensors for use in the
        backward pass using the save_for_backward method.
        """
        self.save_for_backward(input)
        return input.clamp(min=0)

    def backward(self, grad_output):
        """
        In the backward pass we receive a Tensor containing the gradient of the loss
        with respect to the output, and we need to compute the gradient of the loss
        with respect to the input.
        """
        input, = self.saved_tensors
        grad_input = grad_output.clone()
        grad_input[input < 0] = 0
        return grad_input

Then, we use the new defined autograd function in previous example.

In [35]:
# Define new set of weights to test the new functions
w1_new = Variable(tc.randn(D_in, H).type(dtype), requires_grad=True)
w2_new = Variable(tc.randn(H, D_out).type(dtype), requires_grad=True) 

In [36]:
for t in range(200):
    # defining the function to be used
    relu = MyReLU()
    
    y_pred = relu(x.mm(w1_new)).mm(w2_new)
     
    loss = (y_pred - y).pow(2).sum()
    print(t, loss.data[0])
    
    loss.backward()
    
    w1_new.data -= learning_rate * w1_new.grad.data
    w2_new.data -= learning_rate * w2_new.grad.data

    w1_new.grad.data.zero_()
    w2_new.grad.data.zero_()

0 5723820.0
1 2114383.5
2 1400284.875
3 958165.6875
4 671006.25
5 479670.5625
6 349039.03125
7 257918.890625
8 193252.640625
9 146505.40625
10 112228.109375
11 86800.484375
12 67709.09375
13 53213.54296875
14 42102.71875
15 33523.3203125
16 26840.30859375
17 21596.056640625
18 17454.263671875
19 14163.9375
20 11536.53515625
21 9427.9267578125
22 7728.888671875
23 6355.03369140625
24 5238.806640625
25 4328.54638671875
26 3584.089599609375
27 2973.500244140625
28 2471.4169921875
29 2057.49365234375
30 1715.487060546875
31 1432.366455078125
32 1197.55126953125
33 1002.6358032226562
34 840.2159423828125
35 704.8309326171875
36 591.8270874023438
37 497.4017639160156
38 418.3902893066406
39 352.1930847167969
40 296.6844787597656
41 250.0999298095703
42 210.95851135253906
43 178.05197143554688
44 150.35897827148438
45 127.04344940185547
46 107.39552307128906
47 90.8292007446289
48 76.85138702392578
49 65.05194854736328
50 55.085792541503906
51 46.662349700927734
52 39.54330062866211
53 33.519

In [37]:
print(y,y_pred)

Variable containing:
-0.0588 -0.1549 -0.9363 -0.6117
-0.5614  0.2786 -0.6237  0.9257
 0.4124  1.0885 -0.3523 -0.1353
-0.6585  1.9010  0.4739  0.3442
 0.7310  0.3233 -0.5113 -0.4284
-0.4787  0.6141 -0.1065 -0.7604
 0.9709  0.5020  1.8626 -1.3757
-0.5717  0.4749  1.9955  0.1409
 0.1185 -2.0504 -1.4487  1.1535
 2.2414 -0.0043  0.6162 -0.4058
 2.0237 -0.5026 -1.1217  0.5577
 1.7662  0.1556  0.2713  0.9467
-0.6436  0.4181  0.8284 -0.9558
 0.5169 -0.7242  1.9608 -0.7905
 0.2504 -0.1324 -0.3770  1.2509
 0.4430 -1.8406 -0.4133 -0.8317
 2.4253  0.2787 -0.0510 -0.3843
-1.6114  1.1748  1.9463  1.1621
-0.3994 -0.5286  0.7286 -0.1246
-1.0058 -0.8582 -0.5995 -0.6680
 2.8242  0.3650  0.3401 -1.3912
-1.0736 -0.7527  0.4383  0.3862
 0.0242  0.2578  0.5763 -0.1229
 1.1378 -0.3710  0.4106 -0.5006
[torch.cuda.FloatTensor of size 24x4 (GPU 0)]
 Variable containing:
-0.0589 -0.1552 -0.9364 -0.6117
-0.5614  0.2786 -0.6237  0.9257
 0.4124  1.0884 -0.3524 -0.1352
-0.6584  1.9008  0.4737  0.3441
 0.7311  0.3233