In [2]:
import numpy as np
import matplotlib.pyplot as plt

import torch
import torch.nn as nn

from torch.autograd import Variable

# Exercise 0: Basic pytorch

*authors: Asan Agibetov, Georg Dorffner*

For all the exercises create a cell below the definition of the exercise and insert your "code" there. Make sure that it runs and prints out the expected result. 

*Use `Insert -> Insert Cell Below` Jupyter Notebook menu item to insert cells below the currently selected one.*

# Table of contents

1. Tensor manipulation with numpy and pytorch
2. Computational graphs and gradient computation
3. Training a simple Neural Network

## 1.1 Tensor manipulation with numpy

Defines a simple vector $\vec{x_1} = \begin{bmatrix} 1 \\ 2 \\ 3 \end{bmatrix}$

In [3]:
x_1 = np.array([1, 2 ,3])
print(x_1)

[1 2 3]


Elementwise operations on vectors $\vec{x_2} = 3 \vec{x_1} = 3 \begin{bmatrix} 1 \\ 2 \\ 3 \end{bmatrix}$

In [4]:
x_2 = x_1 * 3
print(x_2)

[3 6 9]


You can define matrices with dimensions $m \times n$ or even tensors with arbitrary dimensions $m_1 \times \cdots \times m_n$.

In [5]:
A_1 = np.array([[1, 2, 3], [2, 3, 4], [4, 5, 6]])
print(A_1, A_1.shape) # a 3 x 3 matrix

[[1 2 3]
 [2 3 4]
 [4 5 6]] (3, 3)


In [6]:
A_2 = np.array([A_1, A_1, A_1])
print(A_2, A_2.shape) # a 3 x 3 x 3 tensor, i.e., three A_1 matrices stacked together

[[[1 2 3]
  [2 3 4]
  [4 5 6]]

 [[1 2 3]
  [2 3 4]
  [4 5 6]]

 [[1 2 3]
  [2 3 4]
  [4 5 6]]] (3, 3, 3)


### Exercise 1.1

Create a cell below and define a tensor with dimensions $3 \times 3 \times 3$ consisting of transformed matrices $\begin{pmatrix} 3A_1 \\ 2A_1 + A_1 \\ 10A_1 - 2 \end{pmatrix}$. All operations are elementwise, scalars are "broadcast" to all elements in the tensor

```python
np.array([1, 2]) - 1
>>> [0, 1] # i.e., will subtract 1 from all the elements in the Numpy array.
```

In [7]:
T_0 = np.array([3*A_1,2*A_1+A_1,10*A_1-2])
print(T_0)

[[[ 3  6  9]
  [ 6  9 12]
  [12 15 18]]

 [[ 3  6  9]
  [ 6  9 12]
  [12 15 18]]

 [[ 8 18 28]
  [18 28 38]
  [38 48 58]]]


## 1.2 Tensor manipulation with pytorch

PyTorch tensors can be either created manually or from numpy tensors

In [8]:
X_1 = torch.Tensor([[1, 2, 3], [2, 3, 4], [3, 4, 5]]).long()
print(X_1, X_1.size()) # 3 x 3 Matrix


 1  2  3
 2  3  4
 3  4  5
[torch.LongTensor of size 3x3]
 torch.Size([3, 3])


In [9]:
X_2 = torch.from_numpy(A_1)
print(X_1 - X_2) # subtraction of matrices, notice how we create pytorch tensors from numpy arrays


 0  0  0
 0  0  0
-1 -1 -1
[torch.LongTensor of size 3x3]



### Exercise 1.2

Create a pytorch tensor from the numpy tensor that you have created in `Exercise 1.1` and multiply all its elements by 10.

In [10]:
T_torch = 10*torch.from_numpy(T_0)
print(T_torch)


(0 ,.,.) = 
   30   60   90
   60   90  120
  120  150  180

(1 ,.,.) = 
   30   60   90
   60   90  120
  120  150  180

(2 ,.,.) = 
   80  180  280
  180  280  380
  380  480  580
[torch.LongTensor of size 3x3x3]



## 2. Computational graphs and gradient computation

One of the advantages of using pytorch tensors over numpy tensors, is that we can benefit from the `auograd` pytorch package, which computes gradients automatically for us. To demonstrate it, let us first define a multi-variable function

$$
y = f(w, x, b) = wx + b,
$$

where $w, x, b$ are all variables (i.e., $w, b$ are not constants!). We transform the definition of $y$ into its *computational graph* representation with `pytorch` as follows

In [11]:
w = Variable(torch.Tensor([1]), requires_grad=True)
x = Variable(torch.Tensor([2]), requires_grad=True)
b = Variable(torch.Tensor([3]), requires_grad=True)

y = w * x + b

print(y)

Variable containing:
 5
[torch.FloatTensor of size 1]



Note that we use `torch.autograd.Variable` to indicate that $w, x, b$ will vary over different values, and we indicate that we want to compute gradients with respect to these variables (`requires_grad=True`). Above we evaluate y at the point $(1, 2, 3)$ in the $wxb$ plane.

### Exercise 2.1

Evaluate function $y$ at the point $(2, 3, 5)$. Re-use the defined `pytorch` variables $w, x, b$ by passing them tensors that correspond to the point $(2, 3, 5)$.

In [19]:
w = Variable(torch.Tensor([2]), requires_grad=True)
x = Variable(torch.Tensor([3]), requires_grad=True)
b = Variable(torch.Tensor([5]), requires_grad=True)

y = w * x + b
print(y)

Variable containing:
 11
[torch.FloatTensor of size 1]



Having defined the *computational graph* `y` and the variables `w, x, b`, we can now compute the gradient of `f(w, x, b)`

$$
\nabla f = \begin{bmatrix} \frac{\partial f}{\partial w} \\ \frac{\partial f}{\partial x} \\ \frac{\partial f}{\partial b} \end{bmatrix} = \begin{bmatrix} x \\ w \\ 1 \end{bmatrix}.
$$

Evaluated at the point $(1, 2, 3)$ the gradient $\nabla f(1, 2, 3)$ is the vector $\begin{bmatrix} 2 \\ 1 \\ 1 \end{bmatrix}$. We use `backward` method to automatically compute the gradients in `pytorch`. 

**You might need to rerun the cell with the definition of the computational graph `y`, if you call `backward` more than once.**

In [20]:
y.backward()

print(w.grad) # partial f / partial w
print(x.grad) # partial f / partial x
print(b.grad) # partial f / partial b

Variable containing:
 3
[torch.FloatTensor of size 1]

Variable containing:
 2
[torch.FloatTensor of size 1]

Variable containing:
 1
[torch.FloatTensor of size 1]



### Exercise 2.2

Define a computational graph for the function $g(w, x, b) = wx^2 + b$. Compute its gradients at the point $(2, 3, 5)$.

Use `x*x` for $x^2$ in pytorch.

In [23]:
g = w*x*x + b
print(g)

Variable containing:
 23
[torch.FloatTensor of size 1]



## 3. Training a simple Neural Network

One of the main ingredient of Neural Networks are hidden layers, which implement simple non-linear functions that transform input into the output. `pytorch` uses a notion of *modules* - that can be found in `torch.nn` package - to compositionally build hidden layers. For instance, below we define a linear layer $wx + b$, where $w$ are weight and $b$ are bias terms, which transforms points in 3 dimensions into the points in 2 dimensions $\mathbb{R}^3 \mapsto \mathbb{R}^2$.

In [35]:
linear = nn.Linear(3, 2)
print(linear.weight)
print(linear.bias)

Parameter containing:
 0.0770  0.5457  0.3865
 0.0746  0.0829  0.0883
[torch.FloatTensor of size 2x3]

Parameter containing:
-0.1329
 0.0894
[torch.FloatTensor of size 2]



You can see that the weight matrix has dimensions $W: 2 \times 3$, such that whenever applied to a vector $v \in \mathbb{R}^3$ it will transform it into a vector in $Wv \in \mathbb{R}^2$. Analogously, $b \in \mathbb{R}^2$, so that $Wx + b \in \mathbb{R}^2$ is possible. To add *non-linearity* we will use a *sigmoid* function

$$
\sigma(x) = \frac{1}{1 + e^{-x}}
$$

in composition with our linear *module* - $\sigma(wx + b)$.

In [36]:
sigmoid = nn.Sigmoid() 
# note that the dimensions will be determined automatically

Let's define a random tensor $X_3$ of dimension $5 \times 3$, containing 5 3-dimensional points $x_i$. And apply our simple non-linear module to transform it into a tensor of dimensions $5 \times 2$, i.e., each 3-dimensional point $x_i$ is going to be transformed into a 2-dimensional point.

*To apply `pytorch` module to tensors, you need to wrape them in `Variable`.*

In [38]:
# torch.randn creates a 5 x 3 tensor with random values in it
X_3 = Variable(torch.randn(5, 3)) 
print("turning")
print(X_3)
# notice how the mathematical composition defined above is translated into the composition of pytorch modules
print("into")
print(sigmoid(linear(X_3)))

turning
Variable containing:
-0.7173  0.7474  0.3101
-1.8280 -1.6012 -0.0346
-0.2416 -0.3556 -1.7494
 0.4262 -0.7406 -1.5214
 0.8109  0.4116 -3.0994
[torch.FloatTensor of size 5x3]

into
Variable containing:
 0.5841  0.5312
 0.2385  0.4545
 0.2647  0.4719
 0.2512  0.4814
 0.2604  0.4776
[torch.FloatTensor of size 5x2]



In practice, you will have to write compositions of modules likes this very often; `pytorch` provides a convenience function `torch.nn.Sequence` which composes a list of modules into one module, i.e., it performs function composition $f = x \circ y \circ z$.

In [39]:
net = nn.Sequential(linear, sigmoid)
net(X_3) # equivalent to running sigmoid(linear(X_3))

Variable containing:
 0.5841  0.5312
 0.2385  0.4545
 0.2647  0.4719
 0.2512  0.4814
 0.2604  0.4776
[torch.FloatTensor of size 5x2]

### Exercise 3.1 

Use `torch.nn.ReLU` module in combination of our previosly defined linear module and apply their composition to the tensor $X_3$.

*Note that you first need to create an instance of the module, and then apply that instance to Variables**

```python
torch.nn.Sigmoid(Variable(torch.randn(5, 3))) # won't work

# instead use instances
sigmoid = torch.nn.Sigmoid()
sigmoid(Variable(torch.randn(5, 3)))
``` 

In [44]:
relu = torch.nn.ReLU()
relu_net = nn.Sequential(linear, relu)
sample_result = relu_net(X_3)
print(sample_result)

Variable containing:
 0.3396  0.1252
 0.0000  0.0000
 0.0000  0.0000
 0.0000  0.0000
 0.0000  0.0000
[torch.FloatTensor of size 5x2]



In the following, we will use our hidden layer with the sigmoid activation function to teach a Neural Network to map vectors in $\mathbb{R}^3$ into vectors in $\mathbb{R}^2$. We will simulate a function $h: \mathbb{R}^3 \mapsto \mathbb{R}^2$ with 1000 random samples for input and output.

In [72]:
INPUT = Variable(torch.randn(1000, 3))
OUTPUT = Variable(torch.randn(1000, 2))

Learning can be decomposed into four main steps:

1. forward propagation
    * getting the currently learnt output of the network - prediction
2. loss error computation
    * measuring how far off is the prediction from the true output
3. gradient computation via backpropagation
    * computing gradients with respect to all the inputs in the network
4. optimization and parameter update
    * updating parameters of the network in the opposite direction of their gradients, so that the loss error is minimized in the next iteration
    
For our example, we will use *mean squared error* as our loss error and *stochastic gradient descent* as our optimization algorithm. We will set the learning rate - how fast we are moving in the opposite direction of the gradient - to 0.01.

In [73]:
lr = 0.01
criterion = nn.MSELoss()
optimizer = torch.optim.SGD(net.parameters(), lr=lr)

#### Forward propagation step

We simply apply our network to all the inputs $\sigma(wx_i + b), x_i \in input$. Notice that the `pytorch` modules allow us to perform *vectorized* operations, i.e., perform operations for all inputs and outputs in one run.

In [80]:
prediction = net(INPUT) # transform all 1000 R3 vectors into 1000 R2 vectors
print(prediction)

Variable containing:
 0.5052  0.4709
 0.6324  0.5218
 0.3706  0.4346
       ⋮        
 0.7093  0.5298
 0.3902  0.4712
 0.6473  0.5129
[torch.FloatTensor of size 1000x2]



#### Loss error computation

Here we compute the mean squared error between the predictions for all the inputs $x_i \in input$ and the true outputs $y_i \in output$

$$
L = \frac{1}{n} \sum_i \lVert y_i - f(x_i) \rVert_2 ^2
$$

In [81]:
loss = criterion(prediction, OUTPUT)
print("loss: ", loss.data[0])

loss:  1.1948602199554443


#### Gradient computation via backpropagation

In [82]:
loss.backward()

Notice how the gradients $\frac{\partial L}{\partial w}, \frac{\partial L}{\partial b}$ are "collected" in the parameters - weights and biases - of the modules of the network

In [83]:
linear_module, sigmoid_module = net.children()
print(linear_module.weight.grad)
print(linear_module.bias.grad)

Variable containing:
 0.1438  0.6530  0.3875
 0.1646 -0.0145  0.1030
[torch.FloatTensor of size 2x3]

Variable containing:
 2.2301
 2.7163
[torch.FloatTensor of size 2]



In [84]:
optimizer.step()

`optimizer.step()` is performing automatic update of all the parameters with the gradients (scaled by the learning rate $\nu$). 

$$
\begin{align}
w_i' &:= w_i - \nu \frac{\partial L}{\partial w_i} \\
b_i' &:= b_i - \nu \frac{\partial L}{\partial b_i}
\end{align}
$$

Behind the scenes it is performing updates to the weights and biases of each `pytorch` module. For instance, you could have updated the weights and the biases for the linear module manually as follows:

```python
linear_module.weight.data.sub_(lr * linear_module.weight.grad.data)
linear_module.bias.data.sub_(lr * linear_module.bias.grad.data)
```

In [85]:
prediction_after_one_epoch = net(INPUT)
loss_after_one_epoch = criterion(prediction_after_one_epoch, OUTPUT)
print("loss after 1 epoch of optimization: {}".format(
    loss_after_one_epoch.data[0]))

loss after 1 epoch of optimization: 1.1892273426055908


### Exercise 3.2

Below we showed how the Neural Network can be trained to learn how to map vectors from 3- to 2- dimensional spaces, by performing forward propagation, loss computation and backpropagation for the parameter update. We did all that only once. In practice, Neural Networks are trained for many epochs (thousands), i.e., the above steps are repeated until the `loss` converges (if possible). In this exercise we ask you to train the Neural Network `net` for bigger numbers of epochs, for instance for 100 epochs. Try training for more epochs until the loss converges to values close to zero.

In [None]:
# I renamed all the variables "new_*" just to make sure I wasn't reusing anything from example
# not very elegant but to avoid confusion in this case...
new_linear = nn.Linear(3, 2)
new_sigmoid = nn.Sigmoid() 
new_net = nn.Sequential(new_linear, new_sigmoid)
NEW_INPUT = Variable(torch.randn(1000, 3))
NEW_OUTPUT = Variable(torch.randn(1000, 2))
new_lr = 0.01
new_criterion = nn.MSELoss()
new_optimizer = torch.optim.SGD(new_net.parameters(), lr=new_lr)

def update_net():
    new_prediction = new_net(NEW_INPUT)
    # print("prediction: ", new_prediction)
    new_loss = new_criterion(new_prediction, NEW_OUTPUT)
    # print("is output still the same: ", NEW_OUTPUT[0])
    # print("loss: ", new_loss.data[0])
    new_loss.backward()
    new_linear_module, new_sigmoid_module = new_net.children()
    # print(new_linear_module.weight.grad)
    # print(new_linear_module.bias.grad)
    new_optimizer.step()
    new_prediction_after_one_epoch = new_net(INPUT)
    new_loss_after_one_epoch = new_criterion(new_prediction_after_one_epoch, NEW_OUTPUT)
    # print("loss after 1 epoch of optimization: {}".format(new_loss_after_one_epoch.data[0]))
    return new_loss_after_one_epoch

def train_for_n_steps(train_until_low_loss=False, n_steps=100): 
    # (define arbitrary loss to begin with, we'll update in loop)
    loss = 999
    if train_until_low_loss:
        print("starting open-ended training")
        # update once
        loss = update_net()
        counter = 0
        # then repeat until very low loss found
        while loss.data[0] > 0.05:
            loss = update_net()
            counter += 1
        print("loss after {} steps: {}".format(counter, loss.data[0]))
    else:
        print("starting training for {} steps".format(n_steps))
        # repeat update steps for given number of n_steps
        for i in range(n_steps):
            loss = update_net()
        print("loss after {} steps: {}".format(n_steps, loss.data[0]))
        
    # after either mode, print 
    
# check out loss from fixed 100 steps
train_for_n_steps(False,100)
# now train further until loss close to arbitrary low threshold:
train_for_n_steps(True,0)




starting training for 100 steps
loss after 100 steps: 0.914791464805603
starting open-ended training
