# PyTorch basics - Linear Regression from scratch

<!-- <iframe width="560" height="315" src="https://www.youtube.com/embed/ECHX1s0Kk-o?controls=0" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> -->

Tutorial inspired from [FastAI development notebooks](https://github.com/fastai/fastai_v1/tree/master/dev_nb)

## Machine Learning

<img src="https://i.imgur.com/oJEQe7k.png" width="500">


## Tensors & Gradients

### Import Numpy & PyTorch

In [0]:
import numpy as np
import torch

A tensor is a number, vector, matrix or any n-dimensional array.

### Create tensors. (Add "requires_grad=True" for variables that needs calculation of gradients)

In [0]:
x = torch.tensor(3.)
w = torch.tensor(4., requires_grad=True)
b = torch.tensor(5., requires_grad=True)

### Print tensors

In [3]:
print(x)
print(w)
print(b)

tensor(3.)
tensor(4., requires_grad=True)
tensor(5., requires_grad=True)


In [0]:
torch.tensor?

We can combine tensors with the usual arithmetic operations.

### Arithmetic operations

In [6]:
y = w * x + b
print(y)

tensor(17., grad_fn=<AddBackward0>)


What makes PyTorch special, is that we can automatically compute the derivative of `y` w.r.t. the tensors that have `requires_grad` set to `True` i.e. `w` and `b`.

### Compute gradients

In [0]:
y.backward()

### Display gradients

In [8]:
print('dy/dw:', w.grad)
print('dy/db:', b.grad)

dy/dw: tensor(3.)
dy/db: tensor(1.)


## Problem Statement

We'll create a model that predicts crop yeilds for apples and oranges (*target variables*) by looking at the average temperature, rainfall and humidity (*input variables or features*) in a region. Here's the training data:

<img src="https://i.imgur.com/lBguUV9.png" width="500" />

In a **linear regression** model, each target variable is estimated to be a weighted sum of the input variables, offset by some constant, known as a bias :

```
yeild_apple  = w11 * temp + w12 * rainfall + w13 * humidity + b1
yeild_orange = w21 * temp + w22 * rainfall + w23 * humidity + b2
```

Visually, it means that the yield of apples is a linear or planar function of the temperature, rainfall & humidity.

<img src="https://i.imgur.com/mtkR2lB.png" width="540" >


**Our objective**: Find a suitable set of *weights* and *biases* using the training data, to make accurate predictions.

## Training Data
The training data can be represented using 2 matrices (inputs and targets), each with one row per observation and one column per variable.

### Input (temp, rainfall, humidity)

In [0]:
inputs = np.array([[73, 67, 43], 
                   [91, 88, 64], 
                   [87, 134, 58], 
                   [102, 43, 37], 
                   [69, 96, 70]], dtype='float32')

### Targets (apples, oranges)

In [0]:
targets = np.array([[56, 70], 
                    [81, 101], 
                    [119, 133], 
                    [22, 37], 
                    [103, 119]], dtype='float32')

Before we build a model, we need to convert inputs and targets to PyTorch tensors.

### Convert inputs and targets to tensors

In [11]:
inputs = torch.from_numpy(inputs)
targets = torch.from_numpy(targets)
print(inputs)
print(targets)

tensor([[ 73.,  67.,  43.],
        [ 91.,  88.,  64.],
        [ 87., 134.,  58.],
        [102.,  43.,  37.],
        [ 69.,  96.,  70.]])
tensor([[ 56.,  70.],
        [ 81., 101.],
        [119., 133.],
        [ 22.,  37.],
        [103., 119.]])


## Linear Regression Model (from scratch)

The *weights* and *biases* can also be represented as matrices, initialized with random values. The first row of `w` and the first element of `b` are use to predict the first target variable i.e. yield for apples, and similarly the second for oranges.

### Weights and biases

In [16]:
w = torch.randn(2, 3, requires_grad=True)
b = torch.randn(2, requires_grad=True)
print(w)
print(b)

tensor([[-0.6014, -2.3387, -0.3545],
        [-0.3795, -0.2801, -0.0066]], requires_grad=True)
tensor([-0.4607, -0.6177], requires_grad=True)


The *model* is simply a function that performs a matrix multiplication of the input `x` and the weights `w` (transposed) and adds the bias `b` (replicated for each observation).

$$
\hspace{2.5cm} X \hspace{1.1cm} \times \hspace{1.2cm} W^T \hspace{1.2cm}  + \hspace{1cm} b \hspace{2cm}
$$

$$
\left[ \begin{array}{cc}
73 & 67 & 43 \\
91 & 88 & 64 \\
\vdots & \vdots & \vdots \\
69 & 96 & 70
\end{array} \right]
%
\times
%
\left[ \begin{array}{cc}
w_{11} & w_{21} \\
w_{12} & w_{22} \\
w_{13} & w_{23}
\end{array} \right]
%
+
%
\left[ \begin{array}{cc}
b_{1} & b_{2} \\
b_{1} & b_{2} \\
\vdots & \vdots \\
b_{1} & b_{2} \\
\end{array} \right]
$$

### Define the model

In [0]:
def model(x):
    return x @ w.T + b

The matrix obtained by passing the input data to the model is a set of predictions for the target variables.

### Generate predictions

In [18]:
preds = model(inputs)
print(preds)

tensor([[-216.2981,  -47.3733],
        [-283.6804,  -60.2258],
        [-386.7292,  -71.5529],
        [-175.4813,  -51.6164],
        [-291.2872,  -54.1575]], grad_fn=<AddBackward0>)


### Compare with targets

In [19]:
print(targets)

tensor([[ 56.,  70.],
        [ 81., 101.],
        [119., 133.],
        [ 22.,  37.],
        [103., 119.]])


Because we've started with random weights and biases, the model does not a very good job of predicting the target varaibles.

## Loss Function

We can compare the predictions with the actual targets, using the following method: 
* Calculate the difference between the two matrices (`preds` and `targets`).
* Square all elements of the difference matrix to remove negative values.
* Calculate the average of the elements in the resulting matrix.

The result is a single number, known as the **mean squared error** (MSE).

### MSE loss

$$\operatorname{MSE}=\frac{1}{n}\sum_{i=1}^n(Y_i-\hat{Y_i})^2.$$

In [0]:
def mse(t1, t2):
    diff = t1 - t2
    return torch.sum(diff **2) / diff.numel()

### Compute loss

In [24]:
loss = mse(preds, targets)
print(loss)

tensor(77680.9844, grad_fn=<DivBackward0>)


In [26]:
torch.nn.functional.mse_loss(preds,targets)

tensor(77680.9844, grad_fn=<MseLossBackward>)

The resulting number is called the **loss**, because it indicates how bad the model is at predicting the target variables. Lower the loss, better the model. 

## Compute Gradients

With PyTorch, we can automatically compute the gradient or derivative of the `loss` w.r.t. to the weights and biases, because they have `requires_grad` set to `True`.

### Compute gradients

In [0]:
loss.backward()

The gradients are stored in the `.grad` property of the respective tensors.

### Gradients for weights

In [28]:
print(w)
print(w.grad)

tensor([[-0.6014, -2.3387, -0.3545],
        [-0.3795, -0.2801, -0.0066]], requires_grad=True)
tensor([[-28882.2070, -32889.3672, -19857.5137],
        [-12404.5283, -13979.1191,  -8525.8809]])


### Gradients for bias

In [29]:
print(b)
print(b.grad)

tensor([-0.4607, -0.6177], requires_grad=True)
tensor([-346.8953, -148.9852])


A key insight from calculus is that the gradient indicates the rate of change of the loss, or the slope of the loss function w.r.t. the weights and biases. 

* If a gradient element is **postive**, 
    * **increasing** the element's value slightly will **increase** the loss.
    * **decreasing** the element's value slightly will **decrease** the loss.

<img src="https://i.imgur.com/2H4INoV.png" width="400" />



* If a gradient element is **negative**,
    * **increasing** the element's value slightly will **decrease** the loss.
    * **decreasing** the element's value slightly will **increase** the loss.
    
<img src="https://i.imgur.com/h7E2uAv.png" width="400" />    

The increase or decrease is proportional to the value of the gradient.

Finally, we'll reset the gradients to zero before moving forward, because PyTorch accumulates gradients.

### Reset Gradients to Zero with "tensor.grad.zero_()" method

In [30]:
w.grad.zero_()
b.grad.zero_()
print(w.grad)
print(b.grad)

tensor([[0., 0., 0.],
        [0., 0., 0.]])
tensor([0., 0.])


## Adjust weights and biases using gradient descent

We'll reduce the loss and improve our model using the gradient descent algorithm, which has the following steps:

1. Generate predictions
2. Calculate the loss
3. Compute gradients w.r.t the weights and biases
4. Adjust the weights by subtracting a small quantity proportional to the gradient
5. Reset the gradients to zero

### Generate predictions

In [31]:
preds = model(inputs)
print(preds)

tensor([[-216.2981,  -47.3733],
        [-283.6804,  -60.2258],
        [-386.7292,  -71.5529],
        [-175.4813,  -51.6164],
        [-291.2872,  -54.1575]], grad_fn=<AddBackward0>)


### Calculate the loss

In [32]:
loss = mse(preds, targets)
print(loss)

tensor(77680.9844, grad_fn=<DivBackward0>)


### Compute gradients

In [0]:
loss.backward()

### Adjust weights & reset gradients

$learning\ rate=1e-5$

In [0]:
with torch.no_grad():
    w -= w.grad * 1e-5
    b -= b.grad * 1e-5
    w.grad.zero_()
    b.grad.zero_()

In [35]:
print(w)

tensor([[-0.3125, -2.0098, -0.1559],
        [-0.2555, -0.1403,  0.0786]], requires_grad=True)


With the new weights and biases, the model should have a lower loss.

### Calculate loss

In [36]:
preds = model(inputs)
loss = mse(preds, targets)
print(loss)

tensor(52802.7891, grad_fn=<DivBackward0>)



## Train for multiple epochs

To reduce the loss further, we repeat the process of adjusting the weights and biases using the gradients multiple times. Each iteration is called an epoch.

### Train for 100 epochs

In [0]:
for i in range(100):
    preds = model(inputs)
    loss = mse(preds, targets)
    loss.backward()
    with torch.no_grad():
        w -= w.grad * 1e-5
        b -= b.grad * 1e-5
        w.grad.zero_()
        b.grad.zero_()

### Calculate loss

In [38]:
preds = model(inputs)
loss = mse(preds, targets)
print(loss)

tensor(408.3980, grad_fn=<DivBackward0>)


### Print predictions

In [39]:
preds

tensor([[ 65.2448,  73.4773],
        [ 90.1608, 100.3191],
        [ 87.6127, 128.6615],
        [ 68.7270,  55.7729],
        [ 87.8655, 107.4604]], grad_fn=<AddBackward0>)

### Print targets

In [40]:
targets

tensor([[ 56.,  70.],
        [ 81., 101.],
        [119., 133.],
        [ 22.,  37.],
        [103., 119.]])

## Linear Regression Model using PyTorch built-ins

Let's re-implement the same model using some built-in functions and classes from PyTorch.

### Imports

In [0]:
import torch.nn as nn

### Input (temp, rainfall, humidity)

In [0]:
inputs = np.array([[73, 67, 43], [91, 88, 64], [87, 134, 58], [102, 43, 37], [69, 96, 70], [73, 67, 43], [91, 88, 64], [87, 134, 58], [102, 43, 37], [69, 96, 70], [73, 67, 43], [91, 88, 64], [87, 134, 58], [102, 43, 37], [69, 96, 70]], dtype='float32')


### Targets (apples, oranges)

In [0]:
targets = np.array([[56, 70], [81, 101], [119, 133], [22, 37], [103, 119], 
                    [56, 70], [81, 101], [119, 133], [22, 37], [103, 119], 
                    [56, 70], [81, 101], [119, 133], [22, 37], [103, 119]], dtype='float32')

In [0]:
inputs = torch.from_numpy(inputs)
targets = torch.from_numpy(targets)

### Dataset and DataLoader

We'll create a `TensorDataset`, which allows access to rows from `inputs` and `targets` as tuples. We'll also create a DataLoader, to split the data into batches while training. It also provides other utilities like shuffling and sampling.

### Import tensor dataset & data loader

In [0]:
from torch.utils.data import TensorDataset, DataLoader

### Define dataset

In [47]:
train_ds = TensorDataset(inputs, targets)
train_ds[0:3]

(tensor([[ 73.,  67.,  43.],
         [ 91.,  88.,  64.],
         [ 87., 134.,  58.]]), tensor([[ 56.,  70.],
         [ 81., 101.],
         [119., 133.]]))

### Define data loader

In [48]:
batch_size = 5
train_dl = DataLoader(train_ds, batch_size, shuffle=True)
next(iter(train_dl))

[tensor([[ 91.,  88.,  64.],
         [ 69.,  96.,  70.],
         [ 69.,  96.,  70.],
         [ 73.,  67.,  43.],
         [102.,  43.,  37.]]), tensor([[ 81., 101.],
         [103., 119.],
         [103., 119.],
         [ 56.,  70.],
         [ 22.,  37.]])]

### nn.Linear
Instead of initializing the weights & biases manually, we can define the model using `nn.Linear`.

### Define model

In [50]:
model = nn.Linear(3, 2, bias=True)
print(model.weight)
print(model.bias)

Parameter containing:
tensor([[ 0.1305, -0.3717,  0.5759],
        [ 0.3917,  0.5323, -0.4330]], requires_grad=True)
Parameter containing:
tensor([-0.0409, -0.0455], requires_grad=True)


### Optimizer
Instead of manually manipulating the weights & biases using gradients, we can use the optimizer `optim.SGD`.

### Define optimizer (SGD-Stochastic Gradient Descent)

In [0]:
opt = torch.optim.SGD(model.parameters(), lr=1e-5)

### Loss Function
Instead of defining a loss function manually, we can use the built-in loss function `mse_loss`.

### Import nn.functional

In [0]:
import torch.nn.functional as F

### Define loss function

In [0]:
loss_fn = F.mse_loss

In [54]:
loss = loss_fn(model(inputs), targets)
print(loss)

tensor(4049.7537, grad_fn=<MseLossBackward>)


### Train the model

We are ready to train the model now. We can define a utility function `fit` which trains the model for a given number of epochs.

### Define a utility function to train the model

In [0]:
def fit(num_epochs, model, loss_fn, opt):
    for epoch in range(num_epochs):
        for xb,yb in train_dl:
            # take batch of items using dataloader
            # Generate predictions
            pred = model(xb)
            loss = loss_fn(pred, yb)
            # Perform gradient descent
            loss.backward()
            opt.step()
            opt.zero_grad()
    print('Training loss: ', loss_fn(model(inputs), targets))

### Train the model for 100 epochs

In [119]:
fit(5000, model, loss_fn, opt)

Training loss:  tensor(0.6212, grad_fn=<MseLossBackward>)


### Generate predictions

In [120]:
preds = model(inputs)
preds

tensor([[ 57.6378,  70.6759],
        [ 82.2115, 100.6623],
        [118.9086, 133.1091],
        [ 21.1798,  37.0889],
        [102.1731, 119.3260],
        [ 57.6378,  70.6759],
        [ 82.2115, 100.6623],
        [118.9086, 133.1091],
        [ 21.1798,  37.0889],
        [102.1731, 119.3260],
        [ 57.6378,  70.6759],
        [ 82.2115, 100.6623],
        [118.9086, 133.1091],
        [ 21.1798,  37.0889],
        [102.1731, 119.3260]], grad_fn=<AddmmBackward>)

### Compare with targets

In [121]:
targets

tensor([[ 56.,  70.],
        [ 81., 101.],
        [119., 133.],
        [ 22.,  37.],
        [103., 119.],
        [ 56.,  70.],
        [ 81., 101.],
        [119., 133.],
        [ 22.,  37.],
        [103., 119.],
        [ 56.,  70.],
        [ 81., 101.],
        [119., 133.],
        [ 22.,  37.],
        [103., 119.]])

In [122]:
targets - preds 

tensor([[-1.6378, -0.6759],
        [-1.2115,  0.3377],
        [ 0.0914, -0.1091],
        [ 0.8202, -0.0889],
        [ 0.8269, -0.3260],
        [-1.6378, -0.6759],
        [-1.2115,  0.3377],
        [ 0.0914, -0.1091],
        [ 0.8202, -0.0889],
        [ 0.8269, -0.3260],
        [-1.6378, -0.6759],
        [-1.2115,  0.3377],
        [ 0.0914, -0.1091],
        [ 0.8202, -0.0889],
        [ 0.8269, -0.3260]], grad_fn=<SubBackward0>)

In [123]:
(torch.abs(targets - preds)).max()

tensor(1.6378, grad_fn=<MaxBackward1>)

# Bonus: Feedfoward Neural Network

![ffnn](https://upload.wikimedia.org/wikipedia/commons/thumb/0/00/Multi-Layer_Neural_Network-Vector-Blank.svg/400px-Multi-Layer_Neural_Network-Vector-Blank.svg.png)

Conceptually, you think of feedforward neural networks as two or more linear regression models stacked on top of one another with a non-linear activation function applied between them.

<img src="https://cdn-images-1.medium.com/max/1600/1*XxxiA0jJvPrHEJHD4z893g.png" width="640">

To use a feedforward neural network instead of linear regression, we can extend the `nn.Module` class from PyTorch.

In [0]:
class SimpleNet(nn.Module):
    # Initialize the layers
    def __init__(self):
        super().__init__()
        self.linear1 = nn.Linear(3, 3)
        self.act1 = nn.ReLU() # Activation function
        self.linear2 = nn.Linear(3, 2)
        self.reset()
    
    # Perform the computation
    def forward(self, x):
        x = self.linear1(x)
        x = self.act1(x)
        x = self.linear2(x)
        return x
    
    def reset(self):
      for m in self.children():
          if isinstance(m,nn.Linear):
            torch.nn.init.normal_(m.weight)
            torch.nn.init.normal_(m.bias)

Now we can define the model, optimizer and loss function exactly as before.

In [0]:
model = SimpleNet()
opt = torch.optim.SGD(model.parameters(), 1e-5)
loss_fn = F.mse_loss

Finally, we can apply gradient descent to train the model using the same `fit` function defined earlier for linear regression.

<img src="https://i.imgur.com/g7Rl0r8.png" width="500">

In [135]:
model.reset()
fit(5000, model, loss_fn, opt)

Training loss:  tensor(10.1897, grad_fn=<MseLossBackward>)
