# Neural Networks


## How Neural Networks Learn

Neural networks use supervised learning to fine-tune their parameters (weights and biases) by minimizing the loss function. This process involves repeating these four steps during training:

1. **Forward Pass:** Process input data through the network to generate predictions.
2. **Loss Function:** Measure the error by comparing predictions with actual target values.
3. **Backward Pass:** Calculate gradients of the loss function with respect to the network's parameters using backpropagation.
4. **Update Parameters:** Adjust weights and biases to reduce the loss, guided by the computed gradients.


## Digits Recognizer

&ensp; &ensp; &ensp; &ensp; In this chapter we will build a discriminative model to recognize handwritten digits. To train our neural network, we will use the [MNIST dataset](https://huggingface.co/datasets/ylecun/mnist), which has 60,000 training images and 10,000 test images. All the images are grayscale and 28 x 28 pixels.

```{note}
This chapter will focus on key concepts and challeneges of deep neural networks. Chapter 4 will delve more into working with images and image classification techniques.
```

```{figure} ../images/digit-recognizer.png
---
width: 410px
name: digit-recognizer
---
Digit Recognizer
```

In [1]:
import numpy as np
import torch

# convert PIL image to normalized PyToch tensor
def image_to_tensor(image):
    return torch.tensor(np.array(image)) / 255.0


def preprocess_data(split):
    
    x = []  # list to store image tensors
    y = []  # list to store labels

    for example in split:
        x.append(image_to_tensor(example['image']))
        y.append(example['label'])
    
    return torch.stack(x), torch.tensor(y)

In [2]:
# import libraries
import torch
from datasets import load_dataset

# set print options for PyTorch tensors
torch.set_printoptions(linewidth=140, sci_mode=False, precision=4)

# load the MNIST dataset
ds = load_dataset("ylecun/mnist")

# preprocess the data (we will see this function in Chapter 4)
train_x, train_y = preprocess_data(ds['train'])

print(f"Images: {train_x.shape}")
print(f"Labels: {train_y.shape}")

Images: torch.Size([60000, 28, 28])
Labels: torch.Size([60000])


In [3]:
# transform input tensor from (60000, 28, 28) to (60000, 784)
X = train_x.view(-1, 784)

## MLP

&ensp; &ensp; &ensp; &ensp; A **multilayer perceptron (MLP)** is a type of neural network composed of multiple layers of fully connected neurons with nonlinear activation functions. Our MLP will have 784 input neurons (one for each pixel in the image) and 10 output neurons (one for each possible class: 0 to 9).

```{figure} ../images/MLP.png
---
width: 450px
name: MLP
---
MLP
```

```{admonition} Help
:class: dropdown
The `torch.randn` function generates a tensor filled with random numbers drawn from a standard normal distribution (mean = 0, standard deviation = 1). For more information, please refer to the [PyTorch documentation](https://pytorch.org/docs/stable/generated/torch.randn.html).
```

In [4]:
def initialize_nn(n_hidden = 100):

    g = torch.Generator().manual_seed(1)
    
    W1 = torch.randn((784, n_hidden),      generator=g)
    b1 = torch.zeros(n_hidden)
    W2 = torch.randn((n_hidden, n_hidden), generator=g)
    b2 = torch.zeros(n_hidden)
    W3 = torch.randn((n_hidden, 10),       generator=g)
    b3 = torch.zeros(10)

    parameters = [W1, b1, W2, b2, W3, b3]

    for p in parameters:
        p.requires_grad = True

    return parameters

parameters = initialize_nn()
W1, b1, W2, b2, W3, b3 = parameters

## Forward Pass

&ensp; &ensp; &ensp; &ensp; In the **forward pass**, the input data flows through the neural network, layer by layer, to produce the network's output, known as logits.

```{admonition} Neuron's output (dot product)
<p class="bottom-margin">In section <a href="#1.3"><i>1.3. Neurons</i></a>, we saw that the output of a neuron is given by the formula:</p>

$$
\small
h
=
\sigma\left(
\begin{bmatrix}
x_{1} & x_{2} & \dots & x_{d}
\end{bmatrix}
\cdot
\begin{bmatrix}
w_{1} \\
w_{2} \\
\vdots \\
w_{d}
\end{bmatrix}
+
b
\right)
$$

<p class="no-top-margin">where:</p>

- $d$: Dimensionality of the input vector.
```



````{admonition} Layer's output (single examples)
<p class="bottom-margin">We can adjust the previous formula and use a weight matrix to obtain the output of a layer, that is, the activation values of all the neurons in the layer:</p>

$$
\small
\begin{bmatrix}
h_{1} & h_{2} & \dots & h_{m}
\end{bmatrix}
=
\sigma\left(
\begin{bmatrix}
x_{1} & x_{2} & \dots & x_{d}
\end{bmatrix}
\cdot
\begin{bmatrix}
w_{11} & w_{12} & \dots & w_{1m} \\
w_{21} & w_{22} & \dots & w_{2m} \\
\vdots & \vdots & \ddots & \vdots \\
w_{d1} & w_{d2} & \dots & w_{dm} \\
\end{bmatrix}
+
\begin{bmatrix}
b_{1} & b_{2} & \dots & b_{m}
\end{bmatrix}
\right)
$$

<p class="no-top-margin">where:</p>

- $d$: Dimensionality of the input vector.
- $m$: Number of neurons in the layer.

```{important}
Each column of the weight matrix contains the weights of the connections between a single neuron in the layer and all the neurons in the previous layer.
```

````


````{admonition} Layer's output (multiple example)
<p class="bottom-margin">Matrix multiplication enables us to efficiently calculate in parallel the output of a layer for several input examples:</p>

$$
\tiny
\begin{bmatrix}
h_{11} & h_{12} & \dots & h_{1m} \\
h_{21} & h_{22} & \dots & h_{2m} \\
\vdots & \vdots & \ddots & \vdots \\
h_{N1} & h_{N2} & \dots & h_{Nm}
\end{bmatrix}
=
\sigma\left(
\begin{bmatrix}
x_{11} & h_{12} & \dots & h_{1d} \\
x_{21} & h_{22} & \dots & h_{2d} \\
\vdots & \vdots & \ddots & \vdots \\
x_{N1} & h_{N2} & \dots & h_{Nd}
\end{bmatrix}
\times
\begin{bmatrix}
w_{11} & w_{12} & \dots & w_{1m} \\
w_{21} & w_{22} & \dots & w_{2m} \\
\vdots & \vdots & \ddots & \vdots \\
w_{d1} & w_{d2} & \dots & w_{dm} \\
\end{bmatrix}
+
\begin{bmatrix}
b_{1} & b_{2} & \dots & b_{k}
\end{bmatrix}
\right)
$$

<p class="no-top-margin">where:</p>

- $d$: Dimensionality of the input vector.
- $m$: Number of neurons in the layer.
- $N$: Number of examples in a batch.

```{important}
Each row of the output matrix contains the activation values of all the neurons in the layer for a single input example.
```

````

## Activation functions

### Tanh


### Softmax

&ensp; &ensp; &ensp; &ensp; **Softmax** is an activation function often used in the output layer of neural networks. It transforms raw neural network outputs, known as logits, into probability distributions where each probability represents the model's confidence that a given example belongs to a specific class.

```{figure} ../images/softmax.png
---
width: 500px
name: softmax
---
Softmax
```

In [11]:
# hidden layer 1
h1_pre = X @ W1 + b1
h1 = torch.tanh(h1_pre)

# hidden layer 2
h2_pre = h1 @ W2 + b2
h2 = torch.tanh(h2_pre)

# output layer
logits = h2 @ W3 + b3
probs = logits.exp() / logits.exp().sum(1, keepdims=True)  # softmax

## Calculate Loss

&ensp; &ensp; &ensp; &ensp; In Chapter 2, we defined the forward pass as the step where input data flows through the neural network, layer by layer, to produce the network's output. Now, we are going to see the forward pass as the step where the model, using its current parameters, generates predictions. 


&ensp; &ensp; &ensp; &ensp; **Cross-Entropy Loss** is a widely used loss function for classification tasks that evaluates how well these predictions align with the labels. To calculate the loss, we often use PyTorch's function `F.cross_entropy`, which efficiently applies Softmax and calculates the average NLL.


### Average NLL

```{admonition} Likelihood

The **likelihood** is the product of all the probabilities assigned by the model to the correct classes:

$$
\text{likelihood} = \prod_{i=1}^{N} p_i
$$

where:  
- $p_i$ is the probability assigned by the model to the correct class for the $i$-th example.  
- $N$ is the total number of examples in the dataset.
```

```{admonition} Log Likelihood

Since each $p_i$ is a value between 0 and 1, their product can become very small. To avoid numerical issues, we use the **log likelihood**:

$$
\text{log likelihood} = \log \left(\prod_{i=1}^{N} p_i \right)
$$

Using the product property of logarithms, this can be rewritten as:

$$
\text{log likelihood} = \sum_{i=1}^{N} \log(p_i)
$$
```


````{admonition} Negative Log Likelihood

Looking at the graph of the logarithmic function, please note that:

- If we pass in a probability of $1$, the log probability is $0$.
- If we pass in a lower probability $\left(0 < p < 1 \right)$, the log probability becomes more negative.  
- If we pass in a probability of 0, the log probability is $-\infty$.

```{figure} ../images/log-function.png
---
width: 250px
name: log-function
---
Logarithmic Function
```

Thus, when all the individual probabilities are 1 (the best-case scenario), the log likelihood is 0, and when the probabilities decrease, the log likelihood becomes more negative. To make it eassier to interpret, we use the **negative log likelihood (NLL)**, a positive metric where values closer to 0 indicate better predictions:

$$
\text{NLL} = - \sum_{i=1}^{N} \log(p_i)
$$

````

```{admonition} Average Negative Log Likelihood

To normalize the NLL, it is often divided by the total number of examples in the dataset, resulting in the **average negative log likelihood**:

$$
\text{Average NLL} = - \frac{1}{N} \sum_{i=1}^{N} \log(p_i)
$$

```

Thus, the aveage NLL is a value that is closer to 0 when the probabilies assigned by the model to the correct classes are closer to 1 for all the examples.

```{admonition} Help
:class: dropdown
`torch.view()`
```

In [12]:
# probability assigned by the model to the correct class for the ith example
pi = probs[(torch.arange(logits.shape[0]), train_y)]

# average negative log liklihood
loss = -pi.log().mean()

print(f"Train Loss: {loss:.4f}")

Train Loss: 15.9699


## Backward Pass

In [39]:
N=60000

In [40]:
# hidden layer 1
h1_pre = X @ W1 + b1
h1 = torch.tanh(h1_pre)

# hidden layer 2
h2_pre = h1 @ W2 + b2
h2 = torch.tanh(h2_pre)

# output layer
logits = h2 @ W3 + b3

# softmax
counts = logits.exp()
counts_sum = counts.sum(1, keepdims=True)
counts_sum_inv = counts_sum**-1     # same as 1.0 / counts_sum
probs = counts * counts_sum_inv

# average negative log liklihood
logprobs = probs.log()
loss = -logprobs[range(N), train_y].mean()
loss

tensor(15.9699, grad_fn=<NegBackward0>)

In [41]:
for p in parameters:
  p.grad = None

for t in [logprobs, probs, counts, counts_sum, counts_sum_inv,
          h1, h1_pre, h2, h2_pre, logits]:
  t.retain_grad()

loss.backward()

loss is the negative mean of all the log probs:

loss = -logprobs[range(n), Yb].mean()

<br>

- Let us simplify the problem to *loss = -(a + b + c) / 3*. 

- Algebraically, *loss = -1/3a -1/3b -1/3c*. 

- Then, the derivative of the loss with respect to a (or b or c) is *dloss/da = -1/3*. 

- More generally *dloss/da = -1/n*.

`dlogprobs` will hold the derivative of the loss with respect to all the elements of `logprobs`.



Thus, the derivative of the loss with respect to the **log probabilities of the correct next characters** is -1/n.

The **log probabilties of the incorrect next characters** do not participate in the calculation of the loss. Thus, the derivative of the loss with respect to them is zero.

In [42]:
dlogprobs = torch.zeros_like(logprobs) # matrix of zeros with the shape of logprobs
dlogprobs[range(N), train_y] = -1.0 / N

$$
logprobs = log(probs) \quad \Rightarrow \quad dprobs = \frac{1}{probs} \cdot dlogprobs
$$

In [43]:
dprobs = (1.0 / probs) * dlogprobs

$$
probs = counts \cdot counts\_sum\_inv \quad \Rightarrow \quad
\begin{matrix}
dcounts = counts\_sum\_inv \cdot dprobs \\
dcounts\_sum\_inv = counts \cdot dprobs
\end{matrix}
$$

In [44]:
dcounts = counts_sum_inv * dprobs
dcounts_sum_inv = (counts * dprobs).sum(1, keepdim=True)

$$
counts\_sum\_inv = \frac{1}{counts\_sum} \quad \Rightarrow \quad dcounts\_sum = -\frac{1}{\sqrt{counts\_sum}} \cdot dcounts\_sum\_inv
$$

In [45]:
dcounts_sum = (-counts_sum**-2) * dcounts_sum_inv

$$
counts\_sum = \text{sum of rows of counts} \quad \Rightarrow \quad dcounts = dcounts\_sum
$$

In [46]:
dcounts += torch.ones_like(counts) * dcounts_sum

$$
counts = e^{logits} \quad \Rightarrow \quad dlogits = counts \cdot dcounts
$$

In [47]:
dlogits = counts * dcounts

$$
\text{logits} = h2 \times W3 + b3 \quad \Rightarrow \quad
\begin{matrix}
dh2 = dlogits \times (W3)^T \\
dW3 = (h2)^T \times dlogits \\
db3 = \text{sum of columns of dlogits}
\end{matrix}
$$

In [48]:
dh2 = dlogits @ W3.T
dW3 = h2.T @ dlogits
db3 = dlogits.sum(0)

$$
h2 = tanh(h2\_pre) \quad \Rightarrow \quad dh2\_pre = (1 - (h2)^2) \cdot dh2
$$

In [49]:
dh2_pre = (1.0 - h2**2) * dh2

$$
h2\_pre = h1 \times W2 + b2 \quad \Rightarrow \quad
\begin{matrix}
dh1 = dh2\_pre \times (W2)^T \\
dW2 = (h1)^T \times dh2\_pre \\
db2 = \text{sum of columns of dh2\_pre}
\end{matrix}
$$

In [50]:
dh1 = dh2_pre @ W2.T
dW2 = h1.T @ dh2_pre
db2 = dh2_pre.sum(0)

$$
h1 = tanh(h1\_pre) \quad \Rightarrow \quad dh1\_pre = (1 - (h1)^2) \cdot dh1
$$

In [51]:
dh1_pre = (1.0 - h1**2) * dh1

$$
h1\_pre = X \times W1 + b1 \quad \Rightarrow \quad
\begin{matrix}
dh1 = dh1\_pre \times (W1)^T \\
dW1 = (X)^T \times dh1\_pre \\
db1 = \text{sum of columns of dh1\_pre}
\end{matrix}
$$

In [52]:
dX = dh1_pre @ W1.T
dW1 = X.T @ dh1_pre
db1 = dh1_pre.sum(0)

In [61]:
epochs = 100       # train iterations
lr = 0.1           # learning rate

# intialize neural network
parameters = initialize_nn()
W1, b1, W2, b2, W3, b3 = parameters

for epoch in range(epochs):

    # -------------------- forward pass --------------------

    # hidden layer 1
    h1_pre = X @ W1 + b1
    h1 = torch.tanh(h1_pre)

    # hidden layer 2
    h2_pre = h1 @ W2 + b2
    h2 = torch.tanh(h2_pre)

    # output layer
    logits = h2 @ W3 + b3

    # softmax
    counts = logits.exp()
    counts_sum = counts.sum(1, keepdims=True)
    counts_sum_inv = counts_sum**-1
    probs = counts * counts_sum_inv


    # -------------------- calculate loss --------------------

    # average negative log liklihood
    logprobs = probs.log()
    loss = -logprobs[range(N), train_y].mean()

    # print loss every 10 epochs
    if epoch % 10 == 0 or epoch == epochs - 1:
        print(f"Epoch: {epoch:2d}/{epochs}     Loss: {loss.item():.4f}")
    

    # -------------------- backward pass --------------------

    dlogprobs = torch.zeros_like(logprobs)
    dlogprobs[range(N), train_y] = -1.0 / N

    dprobs = (1.0 / probs) * dlogprobs
    dcounts = counts_sum_inv * dprobs
    dcounts_sum_inv = (counts * dprobs).sum(1, keepdim=True)
    dcounts_sum = (-counts_sum**-2) * dcounts_sum_inv
    dcounts += torch.ones_like(counts) * dcounts_sum

    dlogits = counts * dcounts
    dh2 = dlogits @ W3.T
    dW3 = h2.T @ dlogits
    db3 = dlogits.sum(0)
    dh2_pre = (1.0 - h2**2) * dh2
    dh1 = dh2_pre @ W2.T
    dW2 = h1.T @ dh2_pre
    db2 = dh2_pre.sum(0)
    dh1_pre = (1.0 - h1**2) * dh1
    dX = dh1_pre @ W1.T
    dW1 = X.T @ dh1_pre
    db1 = dh1_pre.sum(0)

    grads = [dW1, db1, dW2, db2, dW3, db3]


    # -------------------- update parameters --------------------

    for p, grad in zip(parameters, grads):
        p.data += -lr * grad

Epoch:  0/100     Loss: 15.9699
Epoch: 10/100     Loss: 10.9262
Epoch: 20/100     Loss: 8.3451
Epoch: 30/100     Loss: 6.7141
Epoch: 40/100     Loss: 5.6266
Epoch: 50/100     Loss: 4.8790
Epoch: 60/100     Loss: 4.3457
Epoch: 70/100     Loss: 3.9500
Epoch: 80/100     Loss: 3.6406
Epoch: 90/100     Loss: 3.3900
Epoch: 99/100     Loss: 3.2024
