In [None]:
import torch
import torch.nn as nn
from torch.optim import SGD
import numpy as np
import matplotlib.pyplot as plt

Suppose one has data that consists of an independent vector and a dependent vector 
 $x_i$ and 
 (
 $i$ is the ith value in the data set). For example:
 -  $x_i$ is the height of the 
th person, and 
 is their weight (predict weight using height)

The goal of a neural network is as follows. Define a function $f$ that depends on parameters $`a`$ that makes predictions:
$$\hat{y_i} =f(x_i;a)$$

One wants to make $\hat{y_i}$ (the predictions) and $y_i$ (the true values) as `close as possible` by modifying the values of $a$. What does as close as possible mean? This depends on the task. In general, one defines a `similarity function` (or **Loss** function) $L(y,\hat{y})$. The more similar all the $y_i$s and $\hat{y_i}$s are, the smaller $L$ should be. For example 1 above, this could be as simple as:
$$L(y,\hat{y}) = \sum_i(y_i-\hat{y_i})^2$$

In [None]:
# independent variable
x = torch.tensor([[6,2],[5,2],[1,3],[7,6]]).float()

# deperndant variable
y = torch.tensor([[1,5,2,5]]).float()

In [None]:
x

In [None]:
y

* So $x_1 = (6,2)$, $x_2=(5,2)$, ...
* So $y_1 = 1$, $y_2=5$, ...

We want to find a function $f$ that depends on parameters $a$ that lets us get from $x$ to $y$.

**Idea**:
1. First multiply each element in $x$ by a $8 \times 2$ matrix (this is 16 parameters $a_i$) - first layer
2. Then multiply each element in $x$ by a $1 \times 8$ matrix (this is 8 parameters $a_i$) - second layer

Define a matrix (takes in a 2d vector and returns a 8d vector)

- IMPORTANT: When the matrix is created, it is initially created with random values.

In [None]:
# so takes in 2 different features (as above e.g. x1 = (6, 2)) and turns that into 8 features
M1 = nn.Linear(2,8,bias=False)
M1

If one passes in a vector $x$
 (the dataset) where each element $x$ 
 (an instance) is a 2d vector, $M$
 will apply the same matrix multiplication to each element $xi$
.

In [None]:
# now you can see the 4 independent varibles which was a 2d vector is now an 8d vector
M1(x)

In [None]:
# this takes in an 8d vecotr as above and provides a 1d vector
M2 = nn.Linear(8,1,bias=False)
M2

In [None]:
# now we chain them together
M2(M1(x))

In [None]:
# you can see this is 2 dimensional where one of them is 1 but we will need to match is to y
M2(M1(x)).shape , y

In [None]:
# so we can call the squeeze method on the layers to make the same shape as y
M2(M1(x)).squeeze()

The weights of the matrices `M1` and `M2` consitute the weights $a$ of the network defined above. In order to optimize for these weights, we first construct our network $f$ as follows:

In [None]:
# nn.Module would typically be used as a base for a super class as there is a lot of functionality - but as a user we just need to define a few things for ourselves
class MyNeuralNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.Matrix1 = nn.Linear(2,8,bias=False)
        self.Matrix2 = nn.Linear(8,1,bias=False)
    def forward(self,x):
        x = self.Matrix1(x)
        x = self.Matrix2(x)
        return x.squeeze()

Constructing the network using a subclass of the `nn.Module` allows the parameters of the network to be conveniently stored. This will be useful later when we need to adjust them.

In [None]:
f = MyNeuralNet()

In [None]:
# Pass in data to the network.
yhat = f(x)
yhat

In [None]:
y

In [None]:
# in this class there is a method called parameters and i can loop through them and call it like so
for par in f.parameters():
    print(par)

# Adjusting $a$ so that $\hat{y}$ and $y$ are similar

Now we define the loss function $L$, which provides a metric of similarity between $y$ and $\hat{y}$. In this case, we will use the `mean squared error loss` function:

In [None]:
L = nn.MSELoss()
L(y,yhat)

Confirming it is doing the same as the regular mean-squared error:

In [None]:
torch.mean((y-yhat)**2)

Note that $L$ depends on $a$, since our predictions $\hat{y}$ depend on the parameters of the network $a$. In this sense, $L=L(a)$. **The main idea behind machine learning** is to compute the gradient or derivative with respect to each parameter
$$\frac{\partial L}{\partial a_i}$$
for each parameter $a_i$ of the network. Then we adjust each parameter as follows:
$$a_i \to a_i - \ell \frac{\partial L}{\partial a_i}$$
where $\ell$ is the learning rate.

**Example**: A loss function that only depends on one parameter:

The idea is to do this over and over again, until one reaches a minimum for $L$. This is called **gradient descent**

* Each pass of the full data set $x$ is called an **epoch**. In this case, we are evaluating $\partial L/\partial a_i$ on the entire dataset $x$ each time we iterate $a_i \to a_i - \ell \frac{\partial L}{\partial a_i}$, so each iteration corresponds to an epoch.

The `SGD`(stochastic gradient descent) takes in all model parameters $a$ along with the learning rate $\ell$.

In [None]:
opt = SGD(f.parameters(), lr=0.001)

Adjust the parameters over and over:

In [None]:
losses = []
for _ in range(50):
    opt.zero_grad() # flush previous epoch's gradient
    loss_value = L(f(x), y) #compute loss - remember f is the model and x is the data
    loss_value.backward() # compute gradient
    opt.step() # Perform iteration using gradient above - this just adjusts all the parameters
    losses.append(loss_value.item())

Plot $L(a)$ as a function of number of iterations

In [None]:
plt.plot(losses);
plt.ylabel('Loss $L(y,\hat{y};a)$');
plt.xlabel('Epochs');

This is as close as we can make the model $f$ predict $y$ from $x$:

In [None]:
f(x)

In [None]:
y