# Warm-up: numpy

Before introducing PyTorch, we will first implement the network using numpy.

Numpy provides an n-dimensional array object, and many functions for manipulating these arrays. Numpy is a generic framework for scientific computing; it does not know anything about computation graphs, or deep learning, or gradients. However we can easily use numpy to fit a two-layer network to random data by manually implementing the forward and backward passes through the network using numpy operations:

In [2]:
# -*- coding: utf-8 -*-
import numpy as np

# Manual backpropagation!!!

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random input and output data
x = np.random.randn(N, D_in)
y = np.random.randn(N, D_out)

# Randomly initialize weights
w1 = np.random.randn(D_in, H)
w2 = np.random.randn(H, D_out)

learning_rate = 1e-6
for t in range(500):
    # Forward pass: compute predicted y
    h = x.dot(w1)
    h_relu = np.maximum(h, 0)
    y_pred = h_relu.dot(w2)

    # Compute and print loss
    loss = np.square(y_pred - y).sum()
    print(t, loss)

    # Backprop to compute gradients of w1 and w2 with respect to loss
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.T.dot(grad_y_pred)
    grad_h_relu = grad_y_pred.dot(w2.T)
    grad_h = grad_h_relu.copy()
    grad_h[h < 0] = 0
    grad_w1 = x.T.dot(grad_h)

    # Update weights
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2

0 33604035.24248965
1 34429935.33959008
2 40323255.426083535
3 43828622.5379537
4 37651793.99561229
5 23818217.722445913
6 11419783.48685687
7 4977190.318096895
8 2438917.617055584
9 1485323.6403014986
10 1076694.827247693
11 857251.0286308117
12 712880.7570681137
13 604933.5932219281
14 519207.9972245651
15 448968.83282428596
16 390525.1846320142
17 341376.76094681583
18 299756.1123010445
19 264318.59620949783
20 233901.57025635347
21 207685.35969553352
22 184972.3479305404
23 165195.76598169864
24 147925.12320808112
25 132788.02413390524
26 119465.96934776581
27 107706.173893275
28 97297.91187987139
29 88053.77324523684
30 79822.27835007265
31 72475.73682922109
32 65901.71706986947
33 60020.049096947354
34 54736.86106719043
35 49981.88496717531
36 45692.89850624885
37 41815.51563226028
38 38308.02971222328
39 35128.28577094067
40 32238.904752444007
41 29615.46747264543
42 27227.95237896195
43 25054.857146111608
44 23072.689125661233
45 21262.836594495446
46 19608.27963535999
47 18094

479 6.608357418764179e-07
480 6.289183212556317e-07
481 5.985511937887261e-07
482 5.696622861604171e-07
483 5.421707811431315e-07
484 5.1600444231775e-07
485 4.911099914530159e-07
486 4.6741785551805545e-07
487 4.4487398338571226e-07
488 4.234211003059386e-07
489 4.0300965824325106e-07
490 3.83582207666172e-07
491 3.650946604829654e-07
492 3.4750234005763734e-07
493 3.307600787884032e-07
494 3.148258421220463e-07
495 2.996664306507415e-07
496 2.8523553581235116e-07
497 2.7150425395149565e-07
498 2.584382058737915e-07
499 2.460011107398181e-07


![title](img/my_backprop2.jpg)

## PyTorch: Tensors

Numpy is a great framework, but it cannot utilize GPUs to accelerate its numerical computations. For modern deep neural networks, GPUs often provide speedups of 50x or greater, so unfortunately numpy won’t be enough for modern deep learning.

Here we introduce the most fundamental PyTorch concept: the Tensor. A PyTorch Tensor is conceptually identical to a numpy array: a Tensor is an n-dimensional array, and PyTorch provides many functions for operating on these Tensors. Like numpy arrays, PyTorch Tensors do not know anything about deep learning or computational graphs or gradients; they are a generic tool for scientific computing.

However unlike numpy, PyTorch Tensors can utilize GPUs to accelerate their numeric computations. To run a PyTorch Tensor on GPU, you simply need to cast it to a new datatype.

Here we use PyTorch Tensors to fit a two-layer network to random data. Like the numpy example above we need to manually implement the forward and backward passes through the network:

In [3]:
# -*- coding: utf-8 -*-
import torch

dtype = torch.float
device = torch.device("cpu")
# dtype = torch.device("cuda:0") # Uncomment this to run on GPU

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random input and output data
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

# Randomly initialize weights
w1 = torch.randn(D_in, H, device=device, dtype=dtype)
w2 = torch.randn(H, D_out, device=device, dtype=dtype)

learning_rate = 1e-6
for t in range(500):
    # Forward pass: compute predicted y
    h = x.mm(w1) # mm - matrix multiply
    h_relu = h.clamp(min=0) # implementation of the ReLU - if a value is less than 0, then bring it to 0
    y_pred = h_relu.mm(w2) # matrix multiply

    # Compute and print loss
    # Use torch.Tensor.item() to get a Python number from a tensor containing a single value:
    loss = (y_pred - y).pow(2).sum().item()
    print(t, loss)

    # Backprop to compute gradients of w1 and w2 with respect to loss
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.t().mm(grad_y_pred)
    grad_h_relu = grad_y_pred.mm(w2.t())
    grad_h = grad_h_relu.clone()
    grad_h[h < 0] = 0
    grad_w1 = x.t().mm(grad_h)

    # Update weights using gradient descent
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2

0 36033240.0
1 31648028.0
2 28145876.0
3 22592880.0
4 15793286.0
5 9816277.0
6 5801947.5
7 3492640.75
8 2248613.0
9 1569968.25
10 1176699.5
11 928932.6875
12 758982.5
13 633928.3125
14 537037.5625
15 459626.40625
16 396316.125
17 343775.59375
18 299834.875
19 262647.78125
20 230957.171875
21 203765.234375
22 180326.28125
23 160031.234375
24 142365.96875
25 126968.0390625
26 113495.015625
27 101661.5234375
28 91242.171875
29 82052.21875
30 73919.8125
31 66710.1015625
32 60301.7734375
33 54610.234375
34 49527.36328125
35 44981.359375
36 40907.08203125
37 37248.23828125
38 33959.82421875
39 30999.015625
40 28327.65234375
41 25915.265625
42 23733.099609375
43 21755.373046875
44 19959.095703125
45 18326.982421875
46 16841.97265625
47 15491.4736328125
48 14260.171875
49 13135.7734375
50 12109.0654296875
51 11169.5986328125
52 10309.5537109375
53 9522.072265625
54 8799.09375
55 8135.7529296875
56 7526.65478515625
57 6966.869140625
58 6452.42138671875
59 5978.87060546875
60 5542.40966796875
61

# Autograd

## PyTorch: Tensors and autograd

In the above examples, we had to manually implement both the forward and backward passes of our neural network. Manually implementing the backward pass is not a big deal for a small two-layer network, but can quickly get very hairy for large complex networks.

Thankfully, we can use automatic differentiation to automate the computation of backward passes in neural networks. The autograd package in PyTorch provides exactly this functionality. When using autograd, the forward pass of your network will define a computational graph; nodes in the graph will be Tensors, and edges will be functions that produce output Tensors from input Tensors. Backpropagating through this graph then allows you to easily compute gradients.

This sounds complicated, it’s pretty simple to use in practice. Each Tensor represents a node in a computational graph. If `x` is a Tensor that has `x.requires_grad=True` then `x.grad` is another Tensor holding the gradient of x with respect to some scalar value.

Here we use PyTorch Tensors and autograd to implement our two-layer network; now we no longer need to manually implement the backward pass through the network:

In [4]:
# -*- coding: utf-8 -*-
import torch

dtype = torch.float
device = torch.device("cpu")
# dtype = torch.device("cuda:0") # Uncomment this to run on GPU

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold input and outputs.
# Setting requires_grad=False indicates that we do not need to compute gradients
# with respect to these Tensors during the backward pass.
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

# Create random Tensors for weights.
# Setting requires_grad=True indicates that we want to compute gradients with
# respect to these Tensors during the backward pass.
w1 = torch.randn(D_in, H, device=device, dtype=dtype, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, dtype=dtype, requires_grad=True)

learning_rate = 1e-6
for t in range(500):
    # Forward pass: compute predicted y using operations on Tensors; these
    # are exactly the same operations we used to compute the forward pass using
    # Tensors, but we do not need to keep references to intermediate values since
    # we are not implementing the backward pass by hand.
    y_pred = x.mm(w1).clamp(min=0).mm(w2)

    # Compute and print loss using operations on Tensors.
    # Now loss is a Tensor of shape (1,)
    # loss.item() gets the a scalar value held in the loss.
    loss = (y_pred - y).pow(2).sum()
    print(t, loss.item())

    # Use autograd to compute the backward pass. This call will compute the
    # gradient of loss with respect to all Tensors with requires_grad=True.
    # After this call w1.grad and w2.grad will be Tensors holding the gradient
    # of the loss with respect to w1 and w2 respectively.
    loss.backward()

    # Manually update weights using gradient descent. Wrap in torch.no_grad()
    # because weights have requires_grad=True, but we don't need to track this
    # in autograd.
    # An alternative way is to operate on weight.data and weight.grad.data.
    # Recall that tensor.data gives a tensor that shares the storage with
    # tensor, but doesn't track history.
    # You can also use torch.optim.SGD to achieve this.
    #
    # torch.no_grad() - Context-manager that disables the gradient calculation.
    # Disabling gradient calculation is useful for inference, when you are sure
    # that you will not call :meth:`Tensor.backward()`. It will reduce memory
    # consumption for computations that would otherwise have `requires_grad=True`.
    # In this mode, the result of every computation will have `requires_grad=False`, 
    # even when the inputs have `requires_grad=True`.
    with torch.no_grad():
        w1 -= learning_rate * w1.grad
        w2 -= learning_rate * w2.grad

        # Manually zero the gradients after updating weights
        w1.grad.zero_()
        w2.grad.zero_()

0 30349034.0
1 26774036.0
2 27398126.0
3 28026024.0
4 25481104.0
5 19725102.0
6 12758246.0
7 7343120.5
8 4022804.25
9 2291578.75
10 1419148.5
11 972351.0
12 725022.0625
13 574163.6875
14 472353.6875
15 397733.15625
16 339850.125
17 293279.375
18 254879.859375
19 222701.09375
20 195484.21875
21 172279.765625
22 152397.515625
23 135267.5
24 120429.6484375
25 107531.7109375
26 96274.0
27 86413.5078125
28 77775.90625
29 70162.171875
30 63447.7265625
31 57491.10546875
32 52193.8515625
33 47471.82421875
34 43250.85546875
35 39470.91015625
36 36078.51953125
37 33027.328125
38 30277.20703125
39 27793.11328125
40 25546.02734375
41 23509.6875
42 21660.9921875
43 19980.173828125
44 18449.609375
45 17053.720703125
46 15778.775390625
47 14612.8876953125
48 13545.2763671875
49 12566.322265625
50 11668.169921875
51 10842.896484375
52 10083.4521484375
53 9383.517578125
54 8738.1328125
55 8142.572265625
56 7592.365234375
57 7083.64697265625
58 6613.09423828125
59 6177.21142578125
60 5773.236328125
61 5

420 0.00030157703440636396
421 0.0002941189450211823
422 0.00028706880402751267
423 0.0002801348455250263
424 0.0002744178636930883
425 0.00026669446378946304
426 0.00026092323241755366
427 0.0002542954753153026
428 0.00024859519908204675
429 0.0002426495630061254
430 0.00023736586445011199
431 0.00023108987079467624
432 0.00022653836640529335
433 0.00022170001466292888
434 0.0002160237927455455
435 0.00021063399617560208
436 0.00020606697944458574
437 0.00020238700381014496
438 0.00019835242710541934
439 0.0001933693274622783
440 0.0001897537731565535
441 0.0001861617638496682
442 0.00018207648827228695
443 0.00017712330736685544
444 0.0001740188163239509
445 0.00017043345724232495
446 0.00016761620645411313
447 0.0001637890818528831
448 0.0001606283476576209
449 0.00015777946100570261
450 0.0001544471742818132
451 0.00015155512664932758
452 0.00014866348647046834
453 0.0001455468445783481
454 0.00014294199354480952
455 0.00014004428521730006
456 0.00013721574214287102
457 0.000134443

# PyTorch: Defining new autograd functions

Under the hood, each primitive autograd operator is really two functions that operate 
on Tensors. The forward function computes output Tensors from input Tensors. The backward function receives the gradient of the output Tensors with respect to some scalar value, and computes the gradient of the input Tensors with respect to that same scalar value.

In PyTorch we can easily define our own autograd operator by defining a subclass of `torch.autograd.Function` and implementing the `forward` and `backward` functions. We can then use our new autograd operator by constructing an instance and calling it like a function, passing Tensors containing input data.

In this example we define our own custom autograd function for performing the `ReLU` nonlinearity, and use it to implement our two-layer network:

In [1]:
# -*- coding: utf-8 -*-
import torch

class MyReLU(torch.autograd.Function):
    """
    We can implement our own custom autograd Functions by subclassing
    torch.autograd.Function and implementing the forward and backward passes
    which operate on Tensors.
    """

    @staticmethod
    def forward(ctx, input):
        """
        In the forward pass we receive a Tensor containing the input and return
        a Tensor containing the output. ctx is a context object that can be used
        to stash information for backward computation. You can cache arbitrary
        objects for use in the backward pass using the ctx.save_for_backward method.
        """
        ctx.save_for_backward(input)
        return input.clamp(min=0)

    @staticmethod
    def backward(ctx, grad_output):
        """
        In the backward pass we receive a Tensor containing the gradient of the loss
        with respect to the output, and we need to compute the gradient of the loss
        with respect to the input.
        """
        input, = ctx.saved_tensors
        grad_input = grad_output.clone()
        grad_input[input < 0] = 0
        return grad_input


dtype = torch.float
device = torch.device("cpu")
# dtype = torch.device("cuda:0") # Uncomment this to run on GPU

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold input and outputs.
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

# Create random Tensors for weights.
w1 = torch.randn(D_in, H, device=device, dtype=dtype, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, dtype=dtype, requires_grad=True)

learning_rate = 1e-6
for t in range(500):
    # To apply our Function, we use Function.apply method. We alias this as 'relu'.
    # apply(fn): Applies fn recursively to every submodule (as returned by .children()) as well as self. 
    # Typical use includes initializing the parameters of a model (see also torch-nn-init).
    relu = MyReLU.apply

    # Forward pass: compute predicted y using operations; we compute
    # ReLU using our custom autograd operation.
    y_pred = relu(x.mm(w1)).mm(w2)

    # Compute and print loss
    loss = (y_pred - y).pow(2).sum()
    print(t, loss.item())

    # Use autograd to compute the backward pass.
    loss.backward()

    # Update weights using gradient descent
    with torch.no_grad():
        w1 -= learning_rate * w1.grad
        w2 -= learning_rate * w2.grad

        # Manually zero the gradients after updating weights
        w1.grad.zero_()
        w2.grad.zero_()


0 26256556.0
1 19328096.0
2 16393646.0
3 14683070.0
4 13093412.0
5 11164629.0
6 9011899.0
7 6853600.0
8 4996895.0
9 3534841.0
10 2481497.75
11 1749664.75
12 1256624.625
13 924579.375
14 700368.6875
15 545851.375
16 437030.625
17 358030.53125
18 299017.40625
19 253686.8125
20 217864.6875
21 188912.953125
22 165098.5
23 145177.265625
24 128301.90625
25 113834.3671875
26 101342.0078125
27 90495.8671875
28 81011.953125
29 72678.7421875
30 65330.9453125
31 58834.52734375
32 53072.71484375
33 47950.67578125
34 43386.0
35 39308.43359375
36 35660.796875
37 32388.84765625
38 29449.255859375
39 26804.515625
40 24423.724609375
41 22274.822265625
42 20333.00390625
43 18576.109375
44 16984.82421875
45 15541.6220703125
46 14231.6923828125
47 13041.642578125
48 11959.16015625
49 10973.828125
50 10075.7099609375
51 9256.6806640625
52 8509.1611328125
53 7826.45458984375
54 7202.6083984375
55 6631.9541015625
56 6109.5859375
57 5631.09521484375
58 5192.53515625
59 4790.3896484375
60 4421.08056640625
61 4

427 2.9477472708094865e-05
428 2.910594230343122e-05
429 2.8616586860152893e-05
430 2.8439231755328365e-05
431 2.807174860208761e-05
432 2.7707241315511055e-05
433 2.7359947125660256e-05
434 2.6997251552529633e-05
435 2.6631314540281892e-05
436 2.6248919311910868e-05
437 2.5720162739162333e-05
438 2.547097028582357e-05
439 2.514050720492378e-05
440 2.4945600671344437e-05
441 2.4662762371008284e-05
442 2.4386825316469185e-05
443 2.4057697373791598e-05
444 2.3829825295251794e-05
445 2.3482607502955943e-05
446 2.3088441594154574e-05
447 2.2729611373506486e-05
448 2.2406637071981095e-05
449 2.212467006756924e-05
450 2.1881833163206466e-05
451 2.1694058887078427e-05
452 2.1461215510498732e-05
453 2.115232200594619e-05
454 2.0871411834377795e-05
455 2.0754992874572054e-05
456 2.0472032701945864e-05
457 2.030149153142702e-05
458 2.0120372937526554e-05
459 1.980361957976129e-05
460 1.983837501029484e-05
461 1.9571418306441046e-05
462 1.942715789482463e-05
463 1.9139137293677777e-05
464 1.89858

# TensorFlow: Static Graphs

PyTorch autograd looks a lot like TensorFlow: in both frameworks we define a computational graph, and use automatic differentiation to compute gradients. 

**The biggest difference between the two is that TensorFlow’s computational graphs are static and PyTorch uses dynamic computational graphs.**

In TensorFlow, we define the computational graph once and then execute the same graph over and over again, possibly feeding different input data to the graph. In PyTorch, each forward pass defines a new computational graph.

**Static graphs are nice because you can optimize the graph up front**; for example a framework might decide to fuse some graph operations for efficiency, or to come up with a strategy for distributing the graph across many GPUs or many machines. If you are reusing the same graph over and over, then this potentially costly up-front optimization can be amortized as the same graph is rerun over and over.

One aspect where static and dynamic graphs differ is **control flow**. For some models we may wish to perform different computation for each data point; for example a recurrent network might be unrolled for different numbers of time steps for each data point; this unrolling can be implemented as a loop. With a static graph the loop construct needs to be a part of the graph; for this reason TensorFlow provides operators such as `tf.scan` for embedding loops into the graph. With dynamic graphs the situation is simpler: since we build graphs on-the-fly for each example, we can use normal imperative flow control to perform computation that differs for each input.

To contrast with the PyTorch autograd example above, here we use TensorFlow to fit a simple two-layer net:

In [2]:
# -*- coding: utf-8 -*-
import tensorflow as tf
import numpy as np

# First we set up the computational graph:

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create placeholders for the input and target data; these will be filled
# with real data when we execute the graph.
x = tf.placeholder(tf.float32, shape=(None, D_in))
y = tf.placeholder(tf.float32, shape=(None, D_out))

# Create Variables for the weights and initialize them with random data.
# A TensorFlow Variable persists its value across executions of the graph.
w1 = tf.Variable(tf.random_normal((D_in, H)))
w2 = tf.Variable(tf.random_normal((H, D_out)))

# Forward pass: Compute the predicted y using operations on TensorFlow Tensors.
# Note that this code does not actually perform any numeric operations; it
# merely sets up the computational graph that we will later execute.
h = tf.matmul(x, w1)
h_relu = tf.maximum(h, tf.zeros(1))
y_pred = tf.matmul(h_relu, w2)

# Compute loss using operations on TensorFlow Tensors
loss = tf.reduce_sum((y - y_pred) ** 2.0)

# Compute gradient of the loss with respect to w1 and w2.
grad_w1, grad_w2 = tf.gradients(loss, [w1, w2])

# Update the weights using gradient descent. To actually update the weights
# we need to evaluate new_w1 and new_w2 when executing the graph. Note that
# in TensorFlow the the act of updating the value of the weights is part of
# the computational graph; in PyTorch this happens outside the computational
# graph.
learning_rate = 1e-6
new_w1 = w1.assign(w1 - learning_rate * grad_w1)
new_w2 = w2.assign(w2 - learning_rate * grad_w2)

# Now we have built our computational graph, so we enter a TensorFlow session to
# actually execute the graph.
with tf.Session() as sess:
    # Run the graph once to initialize the Variables w1 and w2.
    sess.run(tf.global_variables_initializer())

    # Create numpy arrays holding the actual data for the inputs x and targets
    # y
    x_value = np.random.randn(N, D_in)
    y_value = np.random.randn(N, D_out)
    for _ in range(500):
        # Execute the graph many times. Each time it executes we want to bind
        # x_value to x and y_value to y, specified with the feed_dict argument.
        # Each time we execute the graph we want to compute the values for loss,
        # new_w1, and new_w2; the values of these Tensors are returned as numpy
        # arrays.
        loss_value, _, _ = sess.run([loss, new_w1, new_w2],
                                    feed_dict={x: x_value, y: y_value})
        print(loss_value)

  from ._conv import register_converters as _register_converters


34268990.0
33199452.0
44628630.0
83193920.0
177809380.0
198283000.0
42722336.0
1610577.2
1214242.5
1004158.2
839246.2
708091.8
602796.1
517194.44
446851.03
388833.9
340379.94
299625.06
265016.03
235527.94
210188.92
188298.19
169307.48
152764.61
138310.27
125606.41
114414.305
104513.69
95721.84
87881.91
80872.91
74589.555
68942.5
63848.395
59246.605
55086.43
51312.402
47876.28
44742.688
41879.383
39257.152
36852.566
34642.805
32604.56
30724.07
28987.545
27381.47
25893.986
24514.004
23232.195
22039.418
20928.11
19891.775
18924.984
18020.453
17173.758
16380.053
15634.887
14934.75
14276.127
13656.137
13071.886
12520.784
12000.418
11508.973
11043.653
10603.113
10185.683
9789.796
9414.067
9057.221
8718.021
8395.4795
8088.4487
7796.2676
7518.19
7252.69
6999.3135
6758.2974
6529.135
6310.039
6100.456
5899.864
5707.825
5523.721
5347.379
5178.1953
5015.7485
4859.7896
4709.9556
4565.9526
4427.54
4294.4014
4166.287
4042.9436
3924.268
3809.899
3699.6448
3593.3345
3490.8184
3391.9343
3296.511
3204.39

# nn module

## PyTorch: nn

Computational graphs and autograd are a very powerful paradigm for defining complex operators and automatically taking derivatives; however for large neural networks raw autograd can be a bit too low-level.

When building neural networks we frequently think of arranging the computation into layers, some of which have learnable parameters which will be optimized during learning.

In TensorFlow, packages like Keras, TensorFlow-Slim, and TFLearn provide higher-level abstractions over raw computational graphs that are useful for building neural networks.

In PyTorch, the nn package serves this same purpose. The nn package defines a set of Modules, which are roughly equivalent to neural network layers. A Module receives input Tensors and computes output Tensors, but may also hold internal state such as Tensors containing learnable parameters. The nn package also defines a set of useful loss functions that are commonly used when training neural networks.

In this example we use the nn package to implement our two-layer network:

In [3]:
# -*- coding: utf-8 -*-
import torch

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Use the nn package to define our model as a sequence of layers. nn.Sequential
# is a Module which contains other Modules, and applies them in sequence to
# produce its output. Each Linear Module computes output from input using a
# linear function, and holds internal Tensors for its weight and bias.
model = torch.nn.Sequential(
    torch.nn.Linear(D_in, H),
    torch.nn.ReLU(),
    torch.nn.Linear(H, D_out),
)

# The nn package also contains definitions of popular loss functions; in this
# case we will use Mean Squared Error (MSE) as our loss function.
loss_fn = torch.nn.MSELoss(size_average=False)

learning_rate = 1e-4
for t in range(500):
    # Forward pass: compute predicted y by passing x to the model. Module objects
    # override the __call__ operator so you can call them like functions. When
    # doing so you pass a Tensor of input data to the Module and it produces
    # a Tensor of output data.
    y_pred = model(x)

    # Compute and print loss. We pass Tensors containing the predicted and true
    # values of y, and the loss function returns a Tensor containing the
    # loss.
    loss = loss_fn(y_pred, y)
    print(t, loss.item())

    # Zero the gradients before running the backward pass.
    model.zero_grad()

    # Backward pass: compute gradient of the loss with respect to all the learnable
    # parameters of the model. Internally, the parameters of each Module are stored
    # in Tensors with requires_grad=True, so this call will compute gradients for
    # all learnable parameters in the model.
    loss.backward()

    # Update the weights using gradient descent. Each parameter is a Tensor, so
    # we can access and gradients like we did before.
    with torch.no_grad():
        for param in model.parameters():
            param -= learning_rate * param.grad

0 683.2108154296875
1 634.3748779296875
2 591.9528198242188
3 555.0870971679688
4 522.373046875
5 492.73114013671875
6 465.7292785644531
7 440.9951477050781
8 418.2503356933594
9 397.0826110839844
10 377.2153625488281
11 358.3522033691406
12 340.5199279785156
13 323.6549987792969
14 307.6528015136719
15 292.4201965332031
16 277.91278076171875
17 264.048583984375
18 250.73182678222656
19 237.87322998046875
20 225.53953552246094
21 213.6479949951172
22 202.19058227539062
23 191.1853485107422
24 180.62548828125
25 170.4968719482422
26 160.80032348632812
27 151.5240936279297
28 142.65243530273438
29 134.17857360839844
30 126.0836410522461
31 118.3542251586914
32 111.01949310302734
33 104.05113983154297
34 97.4429702758789
35 91.1949691772461
36 85.29071807861328
37 79.72793579101562
38 74.49539947509766
39 69.56654357910156
40 64.93244171142578
41 60.588226318359375
42 56.516319274902344
43 52.70301818847656
44 49.13719177246094
45 45.80743408203125
46 42.7026481628418
47 39.80336380004883

489 4.2133132183153066e-07
490 4.0720198057897505e-07
491 3.936128223358537e-07
492 3.801029606620432e-07
493 3.6751413290403434e-07
494 3.551225518094725e-07
495 3.4313256946916226e-07
496 3.317419441373204e-07
497 3.206062046956504e-07
498 3.097289891229593e-07
499 2.994094359110022e-07
