# Warm-up: numpy

Before introducing PyTorch, we will first implement the network using numpy.

Numpy provides an n-dimensional array object, and many functions for manipulating these arrays. Numpy is a generic framework for scientific computing; it does not know anything about computation graphs, or deep learning, or gradients. However we can easily use numpy to fit a two-layer network to random data by manually implementing the forward and backward passes through the network using numpy operations:

In [2]:
# -*- coding: utf-8 -*-
import numpy as np

# Manual backpropagation!!!

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random input and output data
x = np.random.randn(N, D_in)
y = np.random.randn(N, D_out)

# Randomly initialize weights
w1 = np.random.randn(D_in, H)
w2 = np.random.randn(H, D_out)

learning_rate = 1e-6
for t in range(500):
    # Forward pass: compute predicted y
    h = x.dot(w1)
    h_relu = np.maximum(h, 0)
    y_pred = h_relu.dot(w2)

    # Compute and print loss
    loss = np.square(y_pred - y).sum()
    print(t, loss)

    # Backprop to compute gradients of w1 and w2 with respect to loss
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.T.dot(grad_y_pred)
    grad_h_relu = grad_y_pred.dot(w2.T)
    grad_h = grad_h_relu.copy()
    grad_h[h < 0] = 0
    grad_w1 = x.T.dot(grad_h)

    # Update weights
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2

0 33604035.24248965
1 34429935.33959008
2 40323255.426083535
3 43828622.5379537
4 37651793.99561229
5 23818217.722445913
6 11419783.48685687
7 4977190.318096895
8 2438917.617055584
9 1485323.6403014986
10 1076694.827247693
11 857251.0286308117
12 712880.7570681137
13 604933.5932219281
14 519207.9972245651
15 448968.83282428596
16 390525.1846320142
17 341376.76094681583
18 299756.1123010445
19 264318.59620949783
20 233901.57025635347
21 207685.35969553352
22 184972.3479305404
23 165195.76598169864
24 147925.12320808112
25 132788.02413390524
26 119465.96934776581
27 107706.173893275
28 97297.91187987139
29 88053.77324523684
30 79822.27835007265
31 72475.73682922109
32 65901.71706986947
33 60020.049096947354
34 54736.86106719043
35 49981.88496717531
36 45692.89850624885
37 41815.51563226028
38 38308.02971222328
39 35128.28577094067
40 32238.904752444007
41 29615.46747264543
42 27227.95237896195
43 25054.857146111608
44 23072.689125661233
45 21262.836594495446
46 19608.27963535999
47 18094

479 6.608357418764179e-07
480 6.289183212556317e-07
481 5.985511937887261e-07
482 5.696622861604171e-07
483 5.421707811431315e-07
484 5.1600444231775e-07
485 4.911099914530159e-07
486 4.6741785551805545e-07
487 4.4487398338571226e-07
488 4.234211003059386e-07
489 4.0300965824325106e-07
490 3.83582207666172e-07
491 3.650946604829654e-07
492 3.4750234005763734e-07
493 3.307600787884032e-07
494 3.148258421220463e-07
495 2.996664306507415e-07
496 2.8523553581235116e-07
497 2.7150425395149565e-07
498 2.584382058737915e-07
499 2.460011107398181e-07


![title](img/my_backprop2.jpg)

## PyTorch: Tensors

Numpy is a great framework, but it cannot utilize GPUs to accelerate its numerical computations. For modern deep neural networks, GPUs often provide speedups of 50x or greater, so unfortunately numpy won’t be enough for modern deep learning.

Here we introduce the most fundamental PyTorch concept: the Tensor. A PyTorch Tensor is conceptually identical to a numpy array: a Tensor is an n-dimensional array, and PyTorch provides many functions for operating on these Tensors. Like numpy arrays, PyTorch Tensors do not know anything about deep learning or computational graphs or gradients; they are a generic tool for scientific computing.

However unlike numpy, PyTorch Tensors can utilize GPUs to accelerate their numeric computations. To run a PyTorch Tensor on GPU, you simply need to cast it to a new datatype.

Here we use PyTorch Tensors to fit a two-layer network to random data. Like the numpy example above we need to manually implement the forward and backward passes through the network:

In [3]:
# -*- coding: utf-8 -*-
import torch

dtype = torch.float
device = torch.device("cpu")
# dtype = torch.device("cuda:0") # Uncomment this to run on GPU

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random input and output data
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

# Randomly initialize weights
w1 = torch.randn(D_in, H, device=device, dtype=dtype)
w2 = torch.randn(H, D_out, device=device, dtype=dtype)

learning_rate = 1e-6
for t in range(500):
    # Forward pass: compute predicted y
    h = x.mm(w1) # mm - matrix multiply
    h_relu = h.clamp(min=0) # implementation of the ReLU - if a value is less than 0, then bring it to 0
    y_pred = h_relu.mm(w2) # matrix multiply

    # Compute and print loss
    # Use torch.Tensor.item() to get a Python number from a tensor containing a single value:
    loss = (y_pred - y).pow(2).sum().item()
    print(t, loss)

    # Backprop to compute gradients of w1 and w2 with respect to loss
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.t().mm(grad_y_pred)
    grad_h_relu = grad_y_pred.mm(w2.t())
    grad_h = grad_h_relu.clone()
    grad_h[h < 0] = 0
    grad_w1 = x.t().mm(grad_h)

    # Update weights using gradient descent
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2

0 36033240.0
1 31648028.0
2 28145876.0
3 22592880.0
4 15793286.0
5 9816277.0
6 5801947.5
7 3492640.75
8 2248613.0
9 1569968.25
10 1176699.5
11 928932.6875
12 758982.5
13 633928.3125
14 537037.5625
15 459626.40625
16 396316.125
17 343775.59375
18 299834.875
19 262647.78125
20 230957.171875
21 203765.234375
22 180326.28125
23 160031.234375
24 142365.96875
25 126968.0390625
26 113495.015625
27 101661.5234375
28 91242.171875
29 82052.21875
30 73919.8125
31 66710.1015625
32 60301.7734375
33 54610.234375
34 49527.36328125
35 44981.359375
36 40907.08203125
37 37248.23828125
38 33959.82421875
39 30999.015625
40 28327.65234375
41 25915.265625
42 23733.099609375
43 21755.373046875
44 19959.095703125
45 18326.982421875
46 16841.97265625
47 15491.4736328125
48 14260.171875
49 13135.7734375
50 12109.0654296875
51 11169.5986328125
52 10309.5537109375
53 9522.072265625
54 8799.09375
55 8135.7529296875
56 7526.65478515625
57 6966.869140625
58 6452.42138671875
59 5978.87060546875
60 5542.40966796875
61

# Autograd

## PyTorch: Tensors and autograd

In the above examples, we had to manually implement both the forward and backward passes of our neural network. Manually implementing the backward pass is not a big deal for a small two-layer network, but can quickly get very hairy for large complex networks.

Thankfully, we can use automatic differentiation to automate the computation of backward passes in neural networks. The autograd package in PyTorch provides exactly this functionality. When using autograd, the forward pass of your network will define a computational graph; nodes in the graph will be Tensors, and edges will be functions that produce output Tensors from input Tensors. Backpropagating through this graph then allows you to easily compute gradients.

This sounds complicated, it’s pretty simple to use in practice. Each Tensor represents a node in a computational graph. If `x` is a Tensor that has `x.requires_grad=True` then `x.grad` is another Tensor holding the gradient of x with respect to some scalar value.

Here we use PyTorch Tensors and autograd to implement our two-layer network; now we no longer need to manually implement the backward pass through the network:

In [4]:
# -*- coding: utf-8 -*-
import torch

dtype = torch.float
device = torch.device("cpu")
# dtype = torch.device("cuda:0") # Uncomment this to run on GPU

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold input and outputs.
# Setting requires_grad=False indicates that we do not need to compute gradients
# with respect to these Tensors during the backward pass.
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

# Create random Tensors for weights.
# Setting requires_grad=True indicates that we want to compute gradients with
# respect to these Tensors during the backward pass.
w1 = torch.randn(D_in, H, device=device, dtype=dtype, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, dtype=dtype, requires_grad=True)

learning_rate = 1e-6
for t in range(500):
    # Forward pass: compute predicted y using operations on Tensors; these
    # are exactly the same operations we used to compute the forward pass using
    # Tensors, but we do not need to keep references to intermediate values since
    # we are not implementing the backward pass by hand.
    y_pred = x.mm(w1).clamp(min=0).mm(w2)

    # Compute and print loss using operations on Tensors.
    # Now loss is a Tensor of shape (1,)
    # loss.item() gets the a scalar value held in the loss.
    loss = (y_pred - y).pow(2).sum()
    print(t, loss.item())

    # Use autograd to compute the backward pass. This call will compute the
    # gradient of loss with respect to all Tensors with requires_grad=True.
    # After this call w1.grad and w2.grad will be Tensors holding the gradient
    # of the loss with respect to w1 and w2 respectively.
    loss.backward()

    # Manually update weights using gradient descent. Wrap in torch.no_grad()
    # because weights have requires_grad=True, but we don't need to track this
    # in autograd.
    # An alternative way is to operate on weight.data and weight.grad.data.
    # Recall that tensor.data gives a tensor that shares the storage with
    # tensor, but doesn't track history.
    # You can also use torch.optim.SGD to achieve this.
    #
    # torch.no_grad() - Context-manager that disables the gradient calculation.
    # Disabling gradient calculation is useful for inference, when you are sure
    # that you will not call :meth:`Tensor.backward()`. It will reduce memory
    # consumption for computations that would otherwise have `requires_grad=True`.
    # In this mode, the result of every computation will have `requires_grad=False`, 
    # even when the inputs have `requires_grad=True`.
    with torch.no_grad():
        w1 -= learning_rate * w1.grad
        w2 -= learning_rate * w2.grad

        # Manually zero the gradients after updating weights
        w1.grad.zero_()
        w2.grad.zero_()

0 30349034.0
1 26774036.0
2 27398126.0
3 28026024.0
4 25481104.0
5 19725102.0
6 12758246.0
7 7343120.5
8 4022804.25
9 2291578.75
10 1419148.5
11 972351.0
12 725022.0625
13 574163.6875
14 472353.6875
15 397733.15625
16 339850.125
17 293279.375
18 254879.859375
19 222701.09375
20 195484.21875
21 172279.765625
22 152397.515625
23 135267.5
24 120429.6484375
25 107531.7109375
26 96274.0
27 86413.5078125
28 77775.90625
29 70162.171875
30 63447.7265625
31 57491.10546875
32 52193.8515625
33 47471.82421875
34 43250.85546875
35 39470.91015625
36 36078.51953125
37 33027.328125
38 30277.20703125
39 27793.11328125
40 25546.02734375
41 23509.6875
42 21660.9921875
43 19980.173828125
44 18449.609375
45 17053.720703125
46 15778.775390625
47 14612.8876953125
48 13545.2763671875
49 12566.322265625
50 11668.169921875
51 10842.896484375
52 10083.4521484375
53 9383.517578125
54 8738.1328125
55 8142.572265625
56 7592.365234375
57 7083.64697265625
58 6613.09423828125
59 6177.21142578125
60 5773.236328125
61 5

420 0.00030157703440636396
421 0.0002941189450211823
422 0.00028706880402751267
423 0.0002801348455250263
424 0.0002744178636930883
425 0.00026669446378946304
426 0.00026092323241755366
427 0.0002542954753153026
428 0.00024859519908204675
429 0.0002426495630061254
430 0.00023736586445011199
431 0.00023108987079467624
432 0.00022653836640529335
433 0.00022170001466292888
434 0.0002160237927455455
435 0.00021063399617560208
436 0.00020606697944458574
437 0.00020238700381014496
438 0.00019835242710541934
439 0.0001933693274622783
440 0.0001897537731565535
441 0.0001861617638496682
442 0.00018207648827228695
443 0.00017712330736685544
444 0.0001740188163239509
445 0.00017043345724232495
446 0.00016761620645411313
447 0.0001637890818528831
448 0.0001606283476576209
449 0.00015777946100570261
450 0.0001544471742818132
451 0.00015155512664932758
452 0.00014866348647046834
453 0.0001455468445783481
454 0.00014294199354480952
455 0.00014004428521730006
456 0.00013721574214287102
457 0.000134443

# PyTorch: Defining new autograd functions

Under the hood, each primitive autograd operator is really two functions that operate 
on Tensors. The forward function computes output Tensors from input Tensors. The backward function receives the gradient of the output Tensors with respect to some scalar value, and computes the gradient of the input Tensors with respect to that same scalar value.

In PyTorch we can easily define our own autograd operator by defining a subclass of `torch.autograd.Function` and implementing the `forward` and `backward` functions. We can then use our new autograd operator by constructing an instance and calling it like a function, passing Tensors containing input data.

In this example we define our own custom autograd function for performing the `ReLU` nonlinearity, and use it to implement our two-layer network:

In [None]:
# -*- coding: utf-8 -*-
import torch

class MyReLU(torch.autograd.Function):
    """
    We can implement our own custom autograd Functions by subclassing
    torch.autograd.Function and implementing the forward and backward passes
    which operate on Tensors.
    """

    @staticmethod
    def forward(ctx, input):
        """
        In the forward pass we receive a Tensor containing the input and return
        a Tensor containing the output. ctx is a context object that can be used
        to stash information for backward computation. You can cache arbitrary
        objects for use in the backward pass using the ctx.save_for_backward method.
        """
        ctx.save_for_backward(input)
        return input.clamp(min=0)

    @staticmethod
    def backward(ctx, grad_output):
        """
        In the backward pass we receive a Tensor containing the gradient of the loss
        with respect to the output, and we need to compute the gradient of the loss
        with respect to the input.
        """
        input, = ctx.saved_tensors
        grad_input = grad_output.clone()
        grad_input[input < 0] = 0
        return grad_input


dtype = torch.float
device = torch.device("cpu")
# dtype = torch.device("cuda:0") # Uncomment this to run on GPU

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold input and outputs.
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

# Create random Tensors for weights.
w1 = torch.randn(D_in, H, device=device, dtype=dtype, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, dtype=dtype, requires_grad=True)

learning_rate = 1e-6
for t in range(500):
    # To apply our Function, we use Function.apply method. We alias this as 'relu'.
    relu = MyReLU.apply

    # Forward pass: compute predicted y using operations; we compute
    # ReLU using our custom autograd operation.
    y_pred = relu(x.mm(w1)).mm(w2)

    # Compute and print loss
    loss = (y_pred - y).pow(2).sum()
    print(t, loss.item())

    # Use autograd to compute the backward pass.
    loss.backward()

    # Update weights using gradient descent
    with torch.no_grad():
        w1 -= learning_rate * w1.grad
        w2 -= learning_rate * w2.grad

        # Manually zero the gradients after updating weights
        w1.grad.zero_()
        w2.grad.zero_()
