# Introduction to PyTorch
_(Requires Python 3, PyTorch 1.0.1, TorchVision 0.2.2)_

Let's start by importing the PyTorch library `torch`:

In [None]:
import torch

we'll also need numpy and matplotlib

In [None]:
import numpy as np
import matplotlib.pyplot as plt

## 1. Tensors and Gradients

### 1.1. Tensors
Tensors are like numpy matrices:

In [None]:
a = torch.tensor([[1,2,3],[4,5,6]], dtype=torch.float32)
print(a)

Tensors can be initialized in several ways:

In [None]:
b = torch.ones(2,3)
print(b)

In [None]:
c = torch.zeros(4,4)
print(c)

In [None]:
d = torch.randn(3,3)
print(d)

We can convert numpy arrays to pyTorch tensors using `from_numpy` function:

In [None]:
a = np.array([[2,3,4],[1,0,-1]])
t = torch.from_numpy(a)
print(t)

or implicitly passing the `numpy.array` to the tensor constructor:

In [None]:
t = torch.tensor(a)
print(t)

We can also do element-wise arithmetic operations with tensors:

In [None]:
a = torch.tensor([[1,0],[0,1]])
b = torch.tensor([[1,-1],[-1,1]])
print(a+b)

In [None]:
print(a-b)

In [None]:
print(a*b)

In [None]:
print(a/b)

And other operations, like matrix multiplication using `torch.mm`:

In [None]:
c = torch.mm(a,b)
print(c)

### 1.2. Tensor gradients

PyTorch Tensors represent function constants. We can convert them to function variables by setting `requires_grad` to `True`.

#### `requires_grad`
By default is set to `False`, meaning that the tensor will be interpreted as a constant during differentiation:

In [None]:
t = torch.tensor([[1,1],[2,2]], dtype=torch.float32)
print(t.requires_grad)

The gradient flag is set during initialization but can be changed any time by resetting the `requires_grad` attribute:

In [None]:
t = torch.tensor([1,2,3,4], dtype=torch.float32, requires_grad=True)
print(t.requires_grad)

In [None]:
t.requires_grad = False
print(t.requires_grad)

## 2. Optimization

### 2.1. Fitting a linear model

#### Model description
As an example, we wan to optimize the fitting of a linear model to several datapoints using tensor gradients. We will use stochastic gradient descent to find the values of our parameters that minimize the error between the modeled line and the datapoints.

Since we are fitting a linear model, the parameters to optimize are the slope $a$ and the offset $b$ of the linear equation

$$\hat{y}(x) = a x + b$$

Optimizing the model reduces to finding the values of $a$ and $b$ that minimize the mean squared error of the data samples $(x_i,y_i)$: 

$$\min_{a,b} \frac{1}{N}\sum_{i=1}^N (y_i-\hat{y}(x_i))^2$$


#### Data samples
We start by defining our data samples:

In [None]:
x = torch.tensor([1,2,3,4,5,6,7,8], dtype=torch.float32)
y = torch.tensor([0.4,1.3,1.49,2.01,2.31,2.8,3.13,3.67], dtype=torch.float32)

We can easily visualize the datapoints in a 2D tensor:

In [None]:
xy = torch.stack((x,y))
print(torch.t(xy))

#### Model parameters
We allocate two tensors a and b as parameters:

In [None]:
a = torch.tensor(0.0, dtype=torch.float32, requires_grad=True)
print(a)

In [None]:
b = torch.tensor(0.0, dtype=torch.float32, requires_grad=True)
print(b)

#### Cost function
Also known as the loss function, is the target function to optimize, in our case the mean squared error:

In [None]:
Loss = 1/len(x)*torch.sum(torch.pow(y - a*x - b, 2))

We can evaluate the current loss obtained with the initial parameters:

In [None]:
print(Loss)

#### The optimizer

##### Gradient Updates
The optimizer is a buffer that collects the computed gradients and updates the parameters according to an algorithm. The update is triggered every time we call the method `step()`. It is important to remark that **the optimizer does not compute the gradients**.

##### Gradient Computation
The gradients are computed and stored in the tensors whose `require_grad` attribute is set to `True`. To compute the gradients we have to use `backward()` on a Tensor that represents a function of the parameters.

##### Gradient Update Algorithm
We will be using _Stochastic Gradient Descent_ for optimization. The algorithm is defined in `torch.optim.SGD` and requires the parameters to be optimized and the learning rate:

In [None]:
optimizer = torch.optim.SGD([a,b], lr=0.01)

It is important to reset the gradients stored in the optimizer before each gradient computation:

In [None]:
optimizer.zero_grad()

The gradients are now set to 0:

In [None]:
print(a.grad)
print(b.grad)

We want to compute the gradients of the parameters wrt the Loss function. To do so, we use the `backward()` on `Loss`:

In [None]:
Loss.backward()

See how the gradients are updated in the tensors:

In [None]:
print(a.grad)
print(b.grad)

Once the gradients are computed, we perform a gradient descent step:

In [None]:
optimizer.step()

and reset the stored gradients

In [None]:
optimizer.zero_grad()

Let's check how the values of $a$ and $b$ have changed and whether the loss has decreased:

In [None]:
print(a)

In [None]:
print(b)

In [None]:
Loss = 1/len(x)*torch.sum(torch.pow(y - a*x - b, 2))
print(Loss)

#### SGD iterations
Each SGD step updates our parameters according to the current gradients and the learning rate. We need several iterations to find the optimal values for $a$ and $b$. Let's iterate the same procedure 10 times to refine the fit of the model:

In [None]:
for i in np.arange(10):
    # Reset gradients
    optimizer.zero_grad()
    
    # Compute loss and gradients
    Loss = 1/len(x)*torch.sum(torch.pow(y - a*x - b, 2))
    Loss.backward()
    
    # Update parameters
    optimizer.step()
    
    # Print loss of previous step
    print(Loss)

After 10 steps we see that the loss has converged. We can try to improve it by doing smaller steps (reducing the learning rate), but it's important to remember that the Loss will never reach 0 because the data does not perfectly fit a linear distribution.

Let's iterate again with `LR=0.001`:

In [None]:
optimizer.param_groups[0]['lr'] = 1e-3

In [None]:
for i in np.arange(10):
    # Reset gradients
    optimizer.zero_grad()
    
    # Compute loss and gradients
    Loss = 1/len(x)*torch.sum(torch.pow(y-a*x-b, 2))
    Loss.backward()
    
    # Update parameters
    optimizer.step()
    
    # Print loss of previous step
    print(Loss)

Indeed, it seems that the Loss has converged.

### 2.2. Results

The final Loss is:

In [None]:
print(Loss)

The optimal parameters are:

In [None]:
print(a)

In [None]:
print(b)

And this is how our linear model looks like:

In [None]:
plt.scatter(x.numpy(), y.numpy())
plt.plot(np.arange(9), np.arange(9)*a.detach().numpy()+b.detach().numpy(), 'k')
plt.grid()