<a href="https://colab.research.google.com/github/depthtest/pytorch_2h/blob/master/dnn_pytorch_intro.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# (_Deep_) Neural Networks

## What is a Neural Network

### Inspired by the biological Neuron

![Biological neuron](https://upload.wikimedia.org/wikipedia/commons/thumb/b/b5/Neuron.svg/1024px-Neuron.svg.png)

### A Neuron (Perceptron) from a Neural Network

![Perceptron](https://miro.medium.com/max/700/1*n6sJ4yZQzwKL9wnF5wnVNg.png)

### If we stack multiple perceptrons in layers

![multilayer perceptron](https://depthtest.github.io/content/170100_neural_rsc/neural_net2.jpeg)

If each neuron layer is represented by a function $f_i$ that operates on vectors, then the multilayer combination is a composition of functions $y=f_3(f_2(f_1(\mathbf{x})))$, being $y$ the result of the feedforward network, and $\mathbf{x}$ the input in vectorial notation.

With multiple layers hidden, we now have a **Deep Neural Network**.

## So, NNs are function approximators
Through the composition of simple (non-linear) functions we can achieve arbitrarily complex functions

**BUT**

How do we define those functions?

## Parameters
By changing the weights. 

The weights in the neurons are the parameters to be learned during training because they mark the behaviour of the whole function defined by the network.


## Training (through Gradient Descent)
Ok, but how do we really train a neural network?

First we need a measure of how good/bad the network is doing. This will be the loss function that we want to evaluate (more often than not, this will be based on a distance metric between the network result and the expected _true_ result, although it can be whatever metric you can think of)

Now picture this (sample for 2 parameters; the more parameters you have, more dimensions you will have to work with):
![Loss landscape](https://miro.medium.com/max/1400/0*KUao5zocCfeI_DuF.png)

For each selection of weights the result of using the network with some training (test) data will provide a loss result. Graphing it, we have a landscape of the function loss. If the loss represents the error the network has, we want to **minimize** it!

How to do that? Say hello to calculus.

We have a measure of the rate of change / slope of the loss through the **derivative**. The derivative will tell us in which direction the loss changes most, so if we want to minimize the loss then we have to _move_ the weights in the negative direction of the derivative. 

That's what gradient descent is about, moving the weights in the negative direction of the derivative of the loss:
$$\mathbf{w_{k+1}}=\mathbf{w_k}-\lambda\nabla Loss(\mathbf{w_k})$$

We don't now how much to move, so we use a learning rate, here $\lambda$. This learning rate will be a hyperparameter.

## Backpropagation
That is fine and dandy, but how to do it to each and every neuron?
After all, each neuron represents a whole (composited) function; do we have to do each derivative by hand?

![Backpropagate](https://depthtest.github.io/content/170100_neural_rsc/neural_net_backprop_H.jpeg)

We will apply Calculus Chain Rule. Compute Loss and _PROPAGATE_ it _BACK_ through the network until arriving to the inputs, updating the weights on the way. 

Additionally, doing it this way allows two advantages:
1.   Each neuron knows its mathematical function, so it knows its derivative
2.   The same computation may be reused multiple times, saving computation time in exchange of memory.

So, from the output layer:
$$\nabla Loss(\mathbf{w}_H) = \frac{\partial Loss}{\partial \mathbf{w}_{Hb_i, H}}$$


If we go to the second layer, the loss for each neuron:
$$\nabla Loss(\mathbf{w}_{Hb_0}) = \frac{\partial Loss}{\partial w_{Hb_0, H}} \frac{\partial w_{Hb_0, H}}{\partial \mathbf{w}_{Ha_i, Hb_0}}$$

$$\nabla Loss(\mathbf{w}_{Hb_1}) = \frac{\partial Loss}{\partial w_{Hb_1, H}} \frac{\partial w_{Hb_1, H}}{\partial \mathbf{w}_{Ha_i, Hb_1}}$$

$$\nabla Loss(\mathbf{w}_{Hb_2}) = \frac{\partial Loss}{\partial w_{Hb_2, H}} \frac{\partial w_{Hb_2, H}}{\partial \mathbf{w}_{Ha_i, Hb_2}}$$

$$\nabla Loss(\mathbf{w}_{Hb_3}) = \frac{\partial Loss}{\partial w_{Hb_3, H}} \frac{\partial w_{Hb_3, H}}{\partial \mathbf{w}_{Ha_i, Hb_3}}$$

Then, to the first layer (shown here only the neuron $Ha_0$):
$$\begin{align}
\nabla Loss(\mathbf{w}_{Ha_0}) = 
\frac{\partial }{\partial \mathbf{w}_{X_i,Ha_0}} & \left(
\frac{\partial Loss}{\partial w_{Hb_0, H}} \frac{\partial w_{Hb_0, H}}{\partial w_{Ha_0, Hb_0}} + 
\frac{\partial Loss}{\partial w_{Hb_1, H}} \frac{\partial w_{Hb_1, H}}{\partial w_{Ha_0, Hb_1}} + \frac{\partial Loss}{\partial w_{Hb_2, H}} \frac{\partial w_{Hb_2, H}}{\partial w_{Ha_0, Hb_2}} + 
\frac{\partial Loss}{\partial w_{Hb_3, H}} \frac{\partial w_{Hb_3, H}}{\partial w_{Ha_0, Hb_3}} \right)
\end{align}$$


We could check even how much depends the loss result on each of the inputs (here only shown for $X_0$):
$$\begin{align}
\nabla Loss(X_0) = \frac{\partial }{\partial X_0} & \Bigg( \\
\frac{\partial }{\partial \mathbf{w}_{X_i,Ha_0}} & \left(
\frac{\partial Loss}{\partial w_{Hb_0, H}} \frac{\partial w_{Hb_0, H}}{\partial w_{Ha_0, Hb_0}} + 
\frac{\partial Loss}{\partial w_{Hb_1, H}} \frac{\partial w_{Hb_1, H}}{\partial w_{Ha_0, Hb_1}} +
\frac{\partial Loss}{\partial w_{Hb_2, H}} \frac{\partial w_{Hb_2, H}}{\partial w_{Ha_0, Hb_2}} + 
\frac{\partial Loss}{\partial w_{Hb_3, H}} \frac{\partial w_{Hb_3, H}}{\partial w_{Ha_0, Hb_3}} \right) + \\
\frac{\partial }{\partial \mathbf{w}_{X_i,Ha_1}} & \left(
\frac{\partial Loss}{\partial w_{Hb_0, H}} \frac{\partial w_{Hb_0, H}}{\partial w_{Ha_1, Hb_0}} + 
\frac{\partial Loss}{\partial w_{Hb_1, H}} \frac{\partial w_{Hb_1, H}}{\partial w_{Ha_1, Hb_1}} +
\frac{\partial Loss}{\partial w_{Hb_2, H}} \frac{\partial w_{Hb_2, H}}{\partial w_{Ha_1, Hb_2}} + 
\frac{\partial Loss}{\partial w_{Hb_3, H}} \frac{\partial w_{Hb_3, H}}{\partial w_{Ha_1, Hb_3}} \right) + \\
\frac{\partial }{\partial \mathbf{w}_{X_i,Ha_2}} & \left(
\frac{\partial Loss}{\partial w_{Hb_0, H}} \frac{\partial w_{Hb_0, H}}{\partial w_{Ha_2, Hb_0}} + 
\frac{\partial Loss}{\partial w_{Hb_1, H}} \frac{\partial w_{Hb_1, H}}{\partial w_{Ha_2, Hb_1}} +
\frac{\partial Loss}{\partial w_{Hb_2, H}} \frac{\partial w_{Hb_2, H}}{\partial w_{Ha_2, Hb_2}} + 
\frac{\partial Loss}{\partial w_{Hb_3, H}} \frac{\partial w_{Hb_3, H}}{\partial w_{Ha_2, Hb_3}} \right) + \\
\frac{\partial }{\partial \mathbf{w}_{X_i,Ha_3}} & \left(
\frac{\partial Loss}{\partial w_{Hb_0, H}} \frac{\partial w_{Hb_0, H}}{\partial w_{Ha_3, Hb_0}} + 
\frac{\partial Loss}{\partial w_{Hb_1, H}} \frac{\partial w_{Hb_1, H}}{\partial w_{Ha_3, Hb_1}} +
\frac{\partial Loss}{\partial w_{Hb_2, H}} \frac{\partial w_{Hb_2, H}}{\partial w_{Ha_3, Hb_2}} + 
\frac{\partial Loss}{\partial w_{Hb_3, H}} \frac{\partial w_{Hb_3, H}}{\partial w_{Ha_3, Hb_3}} \right)
\Bigg)
\end{align}$$

# Thankfully all this is taken care of by libraries like PyTorch!

# PyTorch

Important: https://pytorch.org/docs/stable/

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as fun
import torch.optim as optim
import torchvision

## Using Tensors in PyTorch
(they are conceptually the _same_ as numpy arrays)

In [None]:
# numpy
import numpy as np
from numpy.linalg import inv

In [None]:
# numpy
np.eye(3)

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

In [None]:
# pytorch
torch.eye(3)

tensor([[1., 0., 0.],
        [0., 1., 0.],
        [0., 0., 1.]])

In [None]:
# numpy
X = np.random.random((5, 3))
X

array([[0.87449232, 0.28827181, 0.25432292],
       [0.50599129, 0.72203198, 0.77027947],
       [0.38719764, 0.38390987, 0.09011435],
       [0.1561072 , 0.21709713, 0.30539949],
       [0.3002811 , 0.33688603, 0.65911923]])

In [None]:
# pytorch
Y = torch.rand((5, 3))
Y

tensor([[0.4095, 0.7980, 0.3358],
        [0.5951, 0.9889, 0.4334],
        [0.5802, 0.4821, 0.3261],
        [0.0275, 0.8364, 0.2588],
        [0.7586, 0.1681, 0.4458]])

In [None]:
# numpy
X.T @ X

array([[1.28522421, 0.9011333 , 0.89264632],
       [0.9011333 , 0.91244097, 0.95242575],
       [0.89264632, 0.95242575, 1.19383822]])

In [None]:
# pytorch
Y.t() @ Y

tensor([[1.4346, 1.3454, 0.9299],
        [1.3454, 2.5749, 1.1452],
        [0.9299, 1.1452, 0.6726]])

In [None]:
# numpy
inv(X.T @ X)

array([[ 2.60599127, -3.2272788 ,  0.62614424],
       [-3.2272788 , 10.54932019, -6.00301238],
       [ 0.62614424, -6.00301238,  5.15857014]])

In [None]:
# pytorch
torch.inverse(Y.t() @ Y)

tensor([[ 17.0190,   6.4704, -34.5429],
        [  6.4704,   4.0593, -15.8557],
        [-34.5430, -15.8557,  76.2331]])

## More on Tensors

In [None]:
A = torch.eye(3)
A.add(1)

tensor([[2., 1., 1.],
        [1., 2., 1.],
        [1., 1., 2.]])

In [None]:
A[0, 0]

tensor(1.)

In [None]:
A[:, 1:3]

tensor([[0., 0.],
        [1., 0.],
        [0., 1.]])

In [None]:
# pytorch --> numpy
B = A.numpy()
B

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]], dtype=float32)

In [None]:
# numpy --> torch
torch.from_numpy(np.eye(3))

tensor([[1., 0., 0.],
        [0., 1., 0.],
        [0., 0., 1.]], dtype=torch.float64)

## Using the GPU
Here in Colab you will need a GPU-based environment. Go to `Runtime -> Change runtime type -> Hardware accelerator` and select GPU.

TPU is for Tensorflow for now, although there is work being done for PyTorch to be compatible with TPUs.

In [None]:
# Checking if device is available and selecting it
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
device

device(type='cuda', index=0)

In [None]:
# Creating data and moving to device
data = torch.eye(3).to(device)
data

tensor([[1., 0., 0.],
        [0., 1., 0.],
        [0., 0., 1.]], device='cuda:0')

In [None]:
# Computation is now done on device
res = data + data
res

tensor([[2., 0., 0.],
        [0., 2., 0.],
        [0., 0., 2.]], device='cuda:0')

## Automatic differentiation with _autograd_

Tensor can record gradients directly if you tell it do do so, e.g. torch.ones(3, requires_grad=True). Prior to `v0.4` it had to be done with the `Variable` class, there is no need for Variable anymore. Many tutorials still use Variable, be aware!

Related: https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html

You rarely use `torch.autograd` directly. Pretty much everything is part or `torch.Tensor` now. Simply add `requires_grad=True` to the tensors you want to calculate the gradients for. `nn.Module` track gradients automatically (they will define the lego pieces with which to build the networks).

In [None]:
from torch import autograd

In [None]:
x = torch.tensor(2.)
x

tensor(2.)

In [None]:
x = torch.tensor(2., requires_grad=True)
x

tensor(2., requires_grad=True)

In [None]:
print(x.grad)

None


In [None]:
y = x ** 2

print("Grad of x:", x.grad)

Grad of x: None


In [None]:
y = x ** 2
y.backward()

print("Grad of x:", x.grad)

Grad of x: tensor(4.)


In [None]:
# Don't record the gradient
# Useful for inference

params = torch.tensor(2., requires_grad=True)

with torch.no_grad():
    y = x * x
    print(x.grad_fn)

None


## Let's try a linear regression with Gradient Descent in PyTorch


### Getting the California Housing data

In [None]:
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()
housing

Downloading Cal. housing from https://ndownloader.figshare.com/files/5976036 to /root/scikit_learn_data


{'DESCR': '.. _california_housing_dataset:\n\nCalifornia Housing dataset\n--------------------------\n\n**Data Set Characteristics:**\n\n    :Number of Instances: 20640\n\n    :Number of Attributes: 8 numeric, predictive attributes and the target\n\n    :Attribute Information:\n        - MedInc        median income in block\n        - HouseAge      median house age in block\n        - AveRooms      average number of rooms\n        - AveBedrms     average number of bedrooms\n        - Population    block population\n        - AveOccup      average house occupancy\n        - Latitude      house block latitude\n        - Longitude     house block longitude\n\n    :Missing Attribute Values: None\n\nThis dataset was obtained from the StatLib repository.\nhttp://lib.stat.cmu.edu/datasets/\n\nThe target variable is the median house value for California districts.\n\nThis dataset was derived from the 1990 U.S. census, using one row per census\nblock group. A block group is the smallest geograp

In [None]:
n_samples, n_features = housing.data.shape

### Getting the device

In [None]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

### Creating the Tensors from the data

In [None]:
X = torch.from_numpy(housing.data).float().to(device)
y = torch.from_numpy(housing.target.reshape(-1,1)).float().to(device)

### Creating the Module to compute the Linear Regression

In [None]:
class LinReg(nn.Module):
    def __init__(self, input_dim):
        super().__init__()
        self.beta = nn.Linear(input_dim, 1, bias=True)
        
    def forward(self, X):
        return self.beta(X)

### We create the Model, Criterion and Optimizer

In [None]:
model = LinReg(n_features).to(device)
loss_fn = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)

### A train step

In [None]:
# Train step
model.train()  # <-- here
optimizer.zero_grad()

y_ = model(X)
loss = loss_fn(y_, y)

loss.backward()
optimizer.step()

loss.item()

28897.984375

Take into account two things:
1.   We have not divided the dataset into train and test, which should be a must.
2.   This is only a step in the optimization, we should create a loop.

### In order to make inference

In [None]:
# Eval
model.eval()  # <-- here
with torch.no_grad():
    y_ = model(X)

## Other nice things

### Debugging

You can use the Python Debugger (here the IPython Debugger)

In [None]:
from IPython.core.debugger import set_trace

In [None]:
def my_function(x):
    answer = 42
    set_trace()
    answer += x
    return answer

my_function(12)

> [0;32m<ipython-input-35-0515935c5419>[0m(4)[0;36mmy_function[0;34m()[0m
[0;32m      2 [0;31m    [0manswer[0m [0;34m=[0m [0;36m42[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m      3 [0;31m    [0mset_trace[0m[0;34m([0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m----> 4 [0;31m    [0manswer[0m [0;34m+=[0m [0mx[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m      5 [0;31m    [0;32mreturn[0m [0manswer[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m      6 [0;31m[0;34m[0m[0m
[0m
ipdb> c


54

#### Let's revisit the Linear Regression Module

In [None]:
class LinRegBreak(nn.Module):
    def __init__(self, input_dim):
        super().__init__()
        self.beta = nn.Linear(input_dim, 1, bias=True)
        
    def forward(self, X):
        set_trace()
        return self.beta(X)

In [None]:
model = LinRegBreak(n_features).to(device)

# For testing purposes, execute just in inference mode
model.eval()
with torch.no_grad():
    y_ = model(X)
y_

> [0;32m<ipython-input-36-04466ee5b0cf>[0m(8)[0;36mforward[0;34m()[0m
[0;32m      4 [0;31m        [0mself[0m[0;34m.[0m[0mbeta[0m [0;34m=[0m [0mnn[0m[0;34m.[0m[0mLinear[0m[0;34m([0m[0minput_dim[0m[0;34m,[0m [0;36m1[0m[0;34m,[0m [0mbias[0m[0;34m=[0m[0;32mTrue[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m      5 [0;31m[0;34m[0m[0m
[0m[0;32m      6 [0;31m    [0;32mdef[0m [0mforward[0m[0;34m([0m[0mself[0m[0;34m,[0m [0mX[0m[0;34m)[0m[0;34m:[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m      7 [0;31m        [0mset_trace[0m[0;34m([0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m----> 8 [0;31m        [0;32mreturn[0m [0mself[0m[0;34m.[0m[0mbeta[0m[0;34m([0m[0mX[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m
ipdb> c


tensor([[ -33.6637],
        [-331.8139],
        [ -61.0560],
        ...,
        [-130.8249],
        [ -92.3194],
        [-185.5877]], device='cuda:0')

### Saving and restoring models

#### nn.Module.state_dict()

In [None]:
model = LinReg(n_features)
model.state_dict()

OrderedDict([('beta.weight',
              tensor([[-0.0040,  0.1278, -0.1822,  0.2654,  0.2707, -0.2566,  0.0804, -0.2591]])),
             ('beta.bias', tensor([-0.3007]))])

#### nn.Optimizer.state_dict()

In [None]:
optimizer = optim.SGD(model.parameters(), lr=0.1)
optimizer.state_dict()

{'param_groups': [{'dampening': 0,
   'lr': 0.1,
   'momentum': 0,
   'nesterov': False,
   'params': [0, 1],
   'weight_decay': 0}],
 'state': {}}

#### Storing and loading `state_dict`

In [None]:
model_file = "model_state_dict.pt"
torch.save(model.state_dict(), model_file)

In [None]:
model = LinReg(n_features)
model.load_state_dict(torch.load(model_file))

<All keys matched successfully>

#### Storing and loading full_model

In [None]:
model_file = "model_123.pt"
torch.save(model, model_file)

In [None]:
# Only works if code for `LinReg` module is available right now
model = torch.load(model_file)

#### Sample Checkpointing
You can store model, optimizer and arbitrary information and reload it.

Example:
```python
torch.save(
    {
        'model_state_dict': model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict(),
        'epoch': epoch,
        'loss': loss,
    },
    PATH,
)
```