# Introduction to Deep Learning with PyTorch

In this notebook, you'll get introduced to [PyTorch](http://pytorch.org/), a framework for building and training neural networks. PyTorch in a lot of ways behaves like the arrays you love from Numpy. These Numpy arrays, after all, are just tensors. PyTorch takes these tensors and makes it simple to move them to GPUs for the faster processing needed when training neural networks. It also provides a module that automatically calculates gradients (for backpropagation!) and another module specifically for building neural networks. All together, PyTorch ends up being more coherent with Python and the Numpy/Scipy stack compared to TensorFlow and other frameworks.



<img src="assets/andrej.png" width=700px>

## Neural Networks

Deep Learning is based on artificial neural networks which have been around in some form since the late 1950s. The networks are built from individual parts approximating neurons, typically called units or simply "neurons." Each unit has some number of weighted inputs. These weighted inputs are summed together (a linear combination) then passed through an activation function to get the unit's output.

<img src="assets/simple_neuron.png" width=400px>

Mathematically this looks like: 

$$
\begin{align}
y &= f(w_1 x_1 + w_2 x_2 + b) \\
y &= f\left(\sum_i w_i x_i +b \right)
\end{align}
$$

With vectors this is the dot/inner product of two vectors:

$$
h = \begin{bmatrix}
x_1 \, x_2 \cdots  x_n
\end{bmatrix}
\cdot 
\begin{bmatrix}
           w_1 \\
           w_2 \\
           \vdots \\
           w_n
\end{bmatrix}
$$

## Tensors

It turns out neural network computations are just a bunch of linear algebra operations on *tensors*, a generalization of matrices. A vector is a 1-dimensional tensor, a matrix is a 2-dimensional tensor, an array with three indices is a 3-dimensional tensor (RGB color images for example). The fundamental data structure for neural networks are tensors and PyTorch (as well as pretty much every other deep learning framework) is built around tensors.

<img src="assets/tensor_examples.svg" width=600px>

With the basics covered, it's time to explore how we can use PyTorch to build a simple neural network.

In [2]:
import torch
def activation(x):
    """ Sigmoid activation function 
    
        Arguments
        ---------
        x: torch.Tensor
    """
    return 1/(1+torch.exp(-x))

In [3]:
### Generate some data
torch.manual_seed(7) # Set the random seed so things are predictable

# Features are 5 random normal variables
features = torch.randn((1, 5))
# True weights for our data, random normal variables again
weights = torch.randn_like(features)
# and a true bias term
bias = torch.randn((1, 1))
print(features)
weights

tensor([[-0.1468,  0.7861,  0.9468, -1.1143,  1.6908]])


tensor([[-0.8948, -0.3556,  1.2324,  0.1382, -1.6822]])

Above I generated data we can use to get the output of our simple network. This is all just random for now, going forward we'll start using normal data. Going through each relevant line:

features = torch.randn((1, 5)) creates a tensor with shape (1, 5), one row and five columns, that contains values randomly distributed according to the normal distribution with a mean of zero and standard deviation of one.

weights = torch.randn_like(features) creates another tensor with the same shape as features, again containing values from a normal distribution.

Finally, bias = torch.randn((1, 1)) creates a single value from a normal distribution.

PyTorch tensors can be added, multiplied, subtracted, etc, just like Numpy arrays. In general, you'll use PyTorch tensors pretty much the same way you'd use Numpy arrays. They come with some nice benefits though such as GPU acceleration which we'll get to later. For now, use the generated data to calculate the output of this simple single layer network.

# Exercise

Calculate the output of the network with input features features, weights weights, and bias bias. Similar to Numpy, PyTorch has a torch.sum() function, as well as a .sum() method on tensors, for taking sums. Use the function activation defined above as the activation function.

In [16]:
### Solution

# Now, make our labels from our data and true weights

y = activation(torch.sum(features * weights) + bias)
print(y)
y = activation((features * weights).sum() + bias)
print(y)

tensor([[0.1595]])
tensor([[0.1595]])


A much better way is to use matrix multiplication

In [5]:
## Solution
y = activation(torch.mm(features, weights.view(5,1)) + bias)
print(y)

tensor([[0.1595]])


## Now just stack them up!

That's how you can calculate the output for a single neuron. The real power of this algorithm happens when you start stacking these individual units into layers and stacks of layers, into a network of neurons. The output of one layer of neurons becomes the input for the next layer. With multiple input units and output units, we now need to express the weights as a matrix.

<img src='assets/multilayer_diagram_weights.png' width=450px>

The first layer shown on the bottom here are the inputs, understandably called the **input layer**. The middle layer is called the **hidden layer**, and the final layer (on the right) is the **output layer**. We can express this network mathematically with matrices again and use matrix multiplication to get linear combinations for each unit in one operation. For example, the hidden layer ($h_1$ and $h_2$ here) can be calculated 

$$
\vec{h} = [h_1 \, h_2] = 
\begin{bmatrix}
x_1 \, x_2 \cdots \, x_n
\end{bmatrix}
\cdot 
\begin{bmatrix}
           w_{11} & w_{12} \\
           w_{21} &w_{22} \\
           \vdots &\vdots \\
           w_{n1} &w_{n2}
\end{bmatrix}
$$

The output for this small network is found by treating the hidden layer as inputs for the output unit. The network output is expressed simply

$$
y =  f_2 \! \left(\, f_1 \! \left(\vec{x} \, \mathbf{W_1}\right) \mathbf{W_2} \right)
$$

In [6]:
### Generate some data
torch.manual_seed(7) # Set the random seed so things are predictable

# Features are 3 random normal variables
features = torch.randn((1, 3))

# Define the size of each layer in our network
n_input = features.shape[1]     # Number of input units, must match number of input features
n_hidden = 2                    # Number of hidden units 
n_output = 1                    # Number of output units

# Weights for inputs to hidden layer
W1 = torch.randn(n_input, n_hidden)
# Weights for hidden layer to output layer
W2 = torch.randn(n_hidden, n_output)

# and bias terms for hidden and output layers
B1 = torch.randn((1, n_hidden))
B2 = torch.randn((1, n_output))

# Exercise
Calculate the output for this multi-layer network using the weights W1 & W2, and the biases, B1 & B2.

In [19]:
h = activation(torch.mm(features, W1) + B1)
output = activation(torch.mm(h, W2) + B2)
print(output)

tensor([[0.3171]])


# Warm Up : NN using Numpy

Before introducing PyTorch, we will first implement the network using numpy.

Numpy provides an n-dimensional array object, and many functions for manipulating these arrays. Numpy is a generic framework for scientific computing; it does not know anything about computation graphs, or deep learning, or gradients. However we can easily use numpy to fit a two-layer network to random data by manually implementing the forward and backward passes through the network using numpy operations:

In [7]:
import numpy as np

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random input and output data
x = np.random.randn(N, D_in)
y = np.random.randn(N, D_out)

# Randomly initialize weights
w1 = np.random.randn(D_in, H)
w2 = np.random.randn(H, D_out)

learning_rate = 1e-6
for t in range(1): #change as per convenience
  # Forward pass: compute predicted y
  print(x.shape[0],x.shape[1])
  print(w1.shape[0],w1.shape[1])
  h = x.dot(w1)
  #print(x.shape[0],x.shape[1])
  print(h.shape[0],h.shape[1])
  h_relu = np.maximum(h, 0)
  #print(h_relu)
  y_pred = h_relu.dot(w2)
  print(y_pred.shape[0],y_pred.shape[1])
  #print(w1.shape[0],w1.shape[1])
  # Compute and print loss
  loss = np.square(y_pred - y).sum()
  print(t, loss)
  
  # Backprop to compute gradients of w1 and w2 with respect to loss
  grad_y_pred = 2.0 * (y_pred - y)
  grad_w2 = h_relu.T.dot(grad_y_pred)
  grad_h_relu = grad_y_pred.dot(w2.T)
  grad_h = grad_h_relu.copy()
  grad_h[h < 0] = 0
  grad_w1 = x.T.dot(grad_h)
 
  # Update weights
  w1 -= learning_rate * grad_w1
  w2 -= learning_rate * grad_w2

64 1000
1000 100
64 100
64 10
0 32906400.232589435


# PyTorch: NN using Tensors

Numpy is a great framework, but it cannot utilize GPUs to accelerate its numerical computations. For modern deep neural networks, GPUs often provide speedups of 50x or greater, so unfortunately numpy won't be enough for modern deep learning.

Here we introduce the most fundamental PyTorch concept: the Tensor. A PyTorch Tensor is conceptually identical to a numpy array: a Tensor is an n-dimensional array, and PyTorch provides many functions for operating on these Tensors. Any computation you might want to perform with numpy can also be accomplished with PyTorch Tensors; you should think of them as a generic tool for scientific computing.

However unlike numpy, PyTorch Tensors can utilize GPUs to accelerate their numeric computations. To run a PyTorch Tensor on GPU, you use the device argument when constructing a Tensor to place the Tensor on a GPU.

Here we use PyTorch Tensors to fit a two-layer network to random data. Like the numpy example above we manually implement the forward and backward passes through the network, using operations on PyTorch Tensors:

In [8]:
import torch

device = torch.device('cpu')
# device = torch.device('cuda') # Uncomment this to run on GPU

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random input and output data
x = torch.randn(N, D_in, device=device)
y = torch.randn(N, D_out, device=device)

# Randomly initialize weights
w1 = torch.randn(D_in, H, device=device)
w2 = torch.randn(H, D_out, device=device)

learning_rate = 1e-6
for t in range(500):
  # Forward pass: compute predicted y
  h = x.mm(w1)
  h_relu = h.clamp(min=0)
  y_pred = h_relu.mm(w2)

  # Compute and print loss; loss is a scalar, and is stored in a PyTorch Tensor
  # of shape (); we can get its value as a Python number with loss.item().
  loss = (y_pred - y).pow(2).sum()
  print(t, loss.item())

  # Backprop to compute gradients of w1 and w2 with respect to loss
  grad_y_pred = 2.0 * (y_pred - y)
  grad_w2 = h_relu.t().mm(grad_y_pred)
  grad_h_relu = grad_y_pred.mm(w2.t())
  grad_h = grad_h_relu.clone()
  grad_h[h < 0] = 0
  grad_w1 = x.t().mm(grad_h)

  # Update weights using gradient descent
  w1 -= learning_rate * grad_w1
  w2 -= learning_rate * grad_w2

0 31479762.0
1 28913350.0
2 29243552.0
3 27895362.0
4 23107436.0
5 15990101.0
6 9593524.0
7 5352017.5
8 3037138.75
9 1860742.0
10 1259786.875
11 930937.125
12 732243.25
13 599410.5
14 502805.65625
15 428234.1875
16 368409.4375
17 319182.5
18 278019.71875
19 243234.453125
20 213615.03125
21 188273.046875
22 166470.734375
23 147623.84375
24 131275.109375
25 117030.7265625
26 104576.8828125
27 93658.5703125
28 84056.0234375
29 75581.828125
30 68084.046875
31 61440.05859375
32 55534.92578125
33 50277.61328125
34 45588.796875
35 41393.69921875
36 37631.4765625
37 34253.33203125
38 31217.01171875
39 28484.796875
40 26017.4921875
41 23789.0546875
42 21770.765625
43 19941.2109375
44 18280.9375
45 16772.931640625
46 15401.443359375
47 14152.92578125
48 13015.2578125
49 11977.388671875
50 11029.626953125
51 10163.5634765625
52 9370.7041015625
53 8644.6357421875
54 7979.43701171875
55 7370.23876953125
56 6811.2939453125
57 6299.21435546875
58 5828.130859375
59 5394.78857421875
60 4996.0908203125


469 3.5320430470164865e-05
470 3.476472193142399e-05
471 3.4370783396298066e-05
472 3.378089240868576e-05
473 3.342274794704281e-05
474 3.2946394640021026e-05
475 3.23609565384686e-05
476 3.1946576200425625e-05
477 3.1353974918602034e-05
478 3.1048715754877776e-05
479 3.069139347644523e-05
480 3.019629730260931e-05
481 2.9947921575512737e-05
482 2.951656460936647e-05
483 2.9135610020603053e-05
484 2.910097646235954e-05
485 2.8637467039516196e-05
486 2.825950286933221e-05
487 2.789109930745326e-05
488 2.751221472863108e-05
489 2.7171981855644844e-05
490 2.6808334951056167e-05
491 2.649083580763545e-05
492 2.6120926122530364e-05
493 2.586271148175001e-05
494 2.552783553255722e-05
495 2.5130539142992347e-05
496 2.5036015358637087e-05
497 2.4712815502425656e-05
498 2.4380844479310326e-05
499 2.416023926343769e-05


# Computational Graph and Autograd

to be discussed in detail ahead

In [9]:
from torch import FloatTensor
from torch.autograd import Variable


# Define the leaf nodes
a = Variable(FloatTensor([4]))

weights = [Variable(FloatTensor([i]), requires_grad=True) for i in (2, 5, 9, 7)]

# unpack the weights for nicer assignment
w1, w2, w3, w4 = weights

b = w1 * a
c = w2 * a
d = w3 * b + w4 * c
L = (10 - d)

L.backward()

for index, weight in enumerate(weights, start=1):
    gradient, *_ = weight.grad.data
    print("Gradient of w",index," w.r.t to L:", gradient)

Gradient of w 1  w.r.t to L: tensor(-36.)
Gradient of w 2  w.r.t to L: tensor(-28.)
Gradient of w 3  w.r.t to L: tensor(-8.)
Gradient of w 4  w.r.t to L: tensor(-20.)


# Pytorch : NN using Autograd

In the above examples, we had to manually implement both the forward and backward passes of our neural network. Manually implementing the backward pass is not a big deal for a small two-layer network, but can quickly get very hairy for large complex networks.

Thankfully, we can use automatic differentiation to automate the computation of backward passes in neural networks. The autograd package in PyTorch provides exactly this functionality. When using autograd, the forward pass of your network will define a computational graph; nodes in the graph will be Tensors, and edges will be functions that produce output Tensors from input Tensors. Backpropagating through this graph then allows you to easily compute gradients.

This sounds complicated, it's pretty simple to use in practice. If we want to compute gradients with respect to some Tensor, then we set requires_grad=True when constructing that Tensor. Any PyTorch operations on that Tensor will cause a computational graph to be constructed, allowing us to later perform backpropagation through the graph. If x is a Tensor with requires_grad=True, then after backpropagation x.grad will be another Tensor holding the gradient of x with respect to some scalar value.

Sometimes you may wish to prevent PyTorch from building computational graphs when performing certain operations on Tensors with requires_grad=True; for example we usually don't want to backpropagate through the weight update steps when training a neural network. In such scenarios we can use the torch.no_grad() context manager to prevent the construction of a computational graph.

Here we use PyTorch Tensors and autograd to implement our two-layer network; now we no longer need to manually implement the backward pass through the network:

In [12]:
import torch

device = torch.device('cpu')
# device = torch.device('cuda') # Uncomment this to run on GPU

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold input and outputs
x = torch.randn(N, D_in, device=device)
y = torch.randn(N, D_out, device=device)

# Create random Tensors for weights; setting requires_grad=True means that we
# want to compute gradients for these Tensors during the backward pass.
w1 = torch.randn(D_in, H, device=device, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, requires_grad=True)

learning_rate = 1e-6
for t in range(500):
  # Forward pass: compute predicted y using operations on Tensors. Since w1 and
  # w2 have requires_grad=True, operations involving these Tensors will cause
  # PyTorch to build a computational graph, allowing automatic computation of
  # gradients. Since we are no longer implementing the backward pass by hand we
  # don't need to keep references to intermediate values.
  y_pred = x.mm(w1).clamp(min=0).mm(w2)
  
  # Compute and print loss. Loss is a Tensor of shape (), and loss.item()
  # is a Python number giving its value.
  loss = (y_pred - y).pow(2).sum()
  print(t, loss.item())

  # Use autograd to compute the backward pass. This call will compute the
  # gradient of loss with respect to all Tensors with requires_grad=True.
  # After this call w1.grad and w2.grad will be Tensors holding the gradient
  # of the loss with respect to w1 and w2 respectively.
  loss.backward()

  # Update weights using gradient descent. For this step we just want to mutate
  # the values of w1 and w2 in-place; we don't want to build up a computational
  # graph for the update steps, so we use the torch.no_grad() context manager
  # to prevent PyTorch from building a computational graph for the updates
  with torch.no_grad():
    w1 -= learning_rate * w1.grad
    w2 -= learning_rate * w2.grad

    # Manually zero the gradients after running the backward pass
    w1.grad.zero_()
    w2.grad.zero_()

0 28461560.0
1 24581592.0
2 25326950.0
3 26960664.0
4 26731064.0
5 22816684.0
6 16503462.0
7 10222507.0
8 5831145.0
9 3273193.5
10 1939987.0
11 1255465.625
12 892390.25
13 684106.375
14 552766.1875
15 461715.75
16 393643.90625
17 339990.78125
18 296197.75
19 259693.625
20 228803.078125
21 202427.609375
22 179759.453125
23 160171.65625
24 143134.515625
25 128253.15625
26 115216.0234375
27 103752.015625
28 93644.7734375
29 84717.6640625
30 76796.7734375
31 69750.953125
32 63467.5
33 57856.00390625
34 52831.71484375
35 48324.32421875
36 44273.53125
37 40624.16796875
38 37329.21875
39 34349.96875
40 31652.16796875
41 29204.92578125
42 26981.228515625
43 24958.677734375
44 23116.654296875
45 21436.1015625
46 19899.953125
47 18493.55859375
48 17204.76171875
49 16022.283203125
50 14937.9365234375
51 13941.734375
52 13024.3720703125
53 12178.15234375
54 11396.8916015625
55 10674.86328125
56 10006.75390625
57 9387.90625
58 8814.1484375
59 8281.537109375
60 7786.8193359375
61 7327.1962890625
62 

421 0.007303833961486816
422 0.007061677519232035
423 0.006836325861513615
424 0.006610485725104809
425 0.006393070798367262
426 0.006182613782584667
427 0.005979666952043772
428 0.005785953253507614
429 0.005599744617938995
430 0.0054175653494894505
431 0.005245325621217489
432 0.005077387671917677
433 0.004912861622869968
434 0.004753958433866501
435 0.004603030160069466
436 0.004459579475224018
437 0.004314121324568987
438 0.00417733658105135
439 0.00404431251809001
440 0.00392084289342165
441 0.003794789081439376
442 0.0036796829663217068
443 0.003560226410627365
444 0.003451234195381403
445 0.0033441849518567324
446 0.0032423967495560646
447 0.003138990607112646
448 0.003044123761355877
449 0.0029521333053708076
450 0.002860385226085782
451 0.002772762905806303
452 0.0026870639994740486
453 0.0026062503457069397
454 0.0025317114777863026
455 0.002457126509398222
456 0.002383872400969267
457 0.0023117694072425365
458 0.00224283616989851
459 0.0021787809673696756
460 0.0021138032898

# Don't worry if you didn't understand anything above, everything will be explanied much clearly as we go through the next set of notebooks.