# Neural Networks


## Deep Learning

&ensp; &ensp; &ensp; &ensp; **Artificial Intelligence (AI)** is a branch of computer science focused on creating systems that perform tasks that require human-like intelligence, such as language comprehension, pattern recognition, problem-solving, and decision-making.

**Machine Learning (ML)** is a subset of AI that involves training models on data to identify patterns and make predictions. ML models learn from data and improve their accuracy over time using one of these approaches:

- **Supervised Learning**: The model is trained on labeled data and learns to associate inputs with outputs, making it well-suited for classification and regression tasks.
- **Unsupervised Learning**: The model is trained on unlabeled data and learns to identify hidden patterns or groupings within the data, making it well-suited for clustering and association tasks.
- **Reinforcement Learning**: The model learns through rewards and penalties based on its actions, making it well-suited for environments where decision-making is complex, such as strategic games and robotics.

**Deep Learning (DL)** is a specialized area within ML that uses neural networks to recognize complex patterns in large datasets.

```{figure} ../images/deep-learning.png
---
width: 140px
name: deep-learning
---
Deep Learning Overview
```




## Neural Networks

&ensp; &ensp; &ensp; &ensp; **Neural networks** are computational models inspired by the structure of the human brain, designed to recognize patterns and make predictions. They consist of layers of interconnected nodes (often called neurons) that process information through mathematical operations.

A basic neural network has the following structure: 
- **Input Layer**: The first layer receives raw data, like images, text, or numerical values. Each node in this layer represents an input feature.
- **Hidden Layers**: The intermediate layers are between the input and output layers, and process the information. Each hidden layer transforms data from the previous layer, allowing the network to progressively learn and recognize patterns.
- **Output Layer**: The final layer provides the network’s output, such as classifying an image or predicting a value.

```{figure} ../images/neural-network.png
---
width: 340px
name: neural-network
---
Basic Structure of a Neural Network
```



(1.3)=
## Neurons

&ensp; &ensp; &ensp; &ensp; In a neural network, each **neuron** is a fundamental unit that takes in multiple inputs and processes them to produce a single output. As shown in [*Fig. 2 Basic Structure of a Neural Network*](neural-network), each neuron in the hidden and output layers connects to all the neurons in the previous layer. These connections have associated values, known as **weights**, which are adjusted by the model.

&ensp; &ensp; &ensp; &ensp; During the forward pass, the neuron's inputs ($x_{n}$) are multiplied by their corresponding connection weights ($w_{n}$), and the results are summed. An additional value, the **bias** ($b$), is often added to the weighted sum. The result is typically passed through an **activation function** ($\sigma$), which introdues non-linearity and normalizes the neuron's output.

```{figure} ../images/neuron.png
---
width: 250px
name: neuron
---
Structure of a Neuron
```

<p style="margin-top: 0;">The output of the above neuron is:</p>

$$
\text{output} = \sigma(x_{1}w_{1} + x_{2}w_{2} + x_{3}w_{3} + b)
$$

More generally, the output of a neuron is:

$$
\text{output} = \sigma(x_{1}w_{1} + x_{2}w_{2} + x_{3}w_{3} + \dots + x_{n}w_{n} + b)
$$

<p style="margin-top: 0;">where $n$ is the number of inputs to the neuron.</p>

The weighted sum can be expressed as:

$$
\text{output} = \sigma(Σ_{n}x_{n}w_{n} + b)
$$

<p style="margin-top: 0;">where $n$ is the number of inputs to the neuron.</p>

```{important}
The inputs to a neuron are the outputs of the neurons in the previous layer, and the ouput of a neuron is one of the inputs to the neurons in the next layer.
```

```{note}
The weights represent the strength of connection between the neurons. A larger weight indicates a stronger connection.
```




## PyTorch

&ensp; &ensp; &ensp; &ensp; We will use **PyTorch** to build our neural networks. PyTorch is an open-source machine learning library widely used for building and training deep learning models due to its flexibility, ease of use, and efficient computation. The following code sets up the environment for working with PyTorch:

In [2]:
# import PyTorch
import torch

# set print options to avoid scientific notation
torch.set_printoptions(sci_mode=False)

&ensp; &ensp; &ensp; &ensp; PyTorch provides multi-dimensional arrays, known as **tensors**, which are similar to NumPy arrays but optimized for GPU processing. Depending on their dimensions, we will refer to the tensors differently:

A 1-dimensional tensor is called a **vector**. A vector with shape (3) would look like this:

```python
tensor([1, 2, 3])
```

<p style="margin-top: 0;">A 2-dimensional tensor is called a <strong>matrix</strong>. A matrix with shape (2, 3) would look like this:</p>

```python
tensor([[1, 2, 3],
        [4, 5, 6]])
```

<p style="margin-top: 0;">A 3 or more dimensional tensor is called an <strong>n-dimensional tensor</strong>. A 3-dimensional tensor with shape (2, 2, 3) would look like this:</p>

```python
tensor([[[1, 2, 3],
         [4, 5, 6]],
        [[7, 8, 9],
         [10, 11, 12]]])
```




## How Neural Networks Learn

Neural networks use supervised learning to fine-tune their parameters. Specifically, they rely on **gradient descent** to adjust their weights and biases during training. This optimization algorithm involves four key steps:

1. **Forward Pass:** Process input data through the network to generate predictions.
2. **Loss Function:** Measure the error by comparing predictions with actual target values.
3. **Backward Pass:** Calculate gradients of the loss function with respect to the network's parameters using backpropagation.
4. **Update Parameters:** Adjust weights and biases to reduce the loss, guided by the computed gradients.




## Example

&ensp; &ensp; &ensp; &ensp; To better understand gradient descent and how neural networks learn, we will create a simple neural network that converts 3-digit binary numbers into their decimal equivalents. The table below shows the equivalences:

| Binary | Decimal |
|--------|---------|
| 000    | 0       |
| 001    | 1       |
| 010    | 2       |
| 011    | 3       |
| 100    | 4       |
| 101    | 5       |
| 110    | 6       |
| 111    | 7       |

The conversion from a binary number $b_2b_1b_0$ to its decimal equivalent $D$ follows the formula:

$$
D = b_2 \cdot 2^2 + b_1 \cdot 2^1 + b_0 \cdot 2^0
$$

where $b_2, b_1, b_0$ are the binary digits 0 or 1.

Our neural network won't know this formula. Instead, it will learn to approximate this relationship through training. We will train the model with the following input examples and corresponding targets:

In [48]:
x = torch.tensor([[0, 0, 0], [0, 0, 0], [0, 0, 1], [0, 0, 1], [0, 1, 0], [0, 1, 0], [0, 1, 1], [0, 1, 1], 
                  [1, 0, 0], [1, 0, 0], [1, 0, 1], [1, 0, 1], [1, 1, 0], [1, 1, 0], [1, 1, 1], [1, 1, 1]]).float()

targets = torch.tensor([0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7])

print(f"Input:\n{x}\n")
print(f"Targets:\n{targets}")

Input:
tensor([[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 1.],
        [0., 0., 1.],
        [0., 1., 0.],
        [0., 1., 0.],
        [0., 1., 1.],
        [0., 1., 1.],
        [1., 0., 0.],
        [1., 0., 0.],
        [1., 0., 1.],
        [1., 0., 1.],
        [1., 1., 0.],
        [1., 1., 0.],
        [1., 1., 1.],
        [1., 1., 1.]])

Targets:
tensor([0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7])


### Network Structure

&ensp; &ensp; &ensp; &ensp; The neural network will have the following structure





To begin, we will initialize the network parameters randomly. For simplicity, we will exclude bias terms and activation functions at this stage.

```{note}
:class: dropdown
The `torch.randn` function generates a tensor filled with random numbers drawn from a standard normal distribution (mean = 0, standard deviation = 1). For more information, please refer to the [PyTorch documentation](https://pytorch.org/docs/stable/generated/torch.randn.html).
```

```{note}
:class: dropdown
The `requires_grad=True` argument indicates that the tensor should track gradients for operations. This will enable to run `loss.backward()` in the backward pass.
```

In [42]:
# seed the random number generator for reproducibility
g = torch.Generator().manual_seed(1)

# intitialize randomly weight matrices
W1 = torch.randn((3, 20), generator=g)   # (input_to_layer, output_from_layer)
W2 = torch.randn((20, 8), generator=g)   # (input_to_layer, output_from_layer)

# list of parameters
parameters = [W1, W2]

# track gradients
for p in parameters:
    p.requires_grad = True

# print total number of parameters
print(f"Number of parameters: {sum(p.nelement() for p in parameters)}")

Number of parameters: 220



## Forward Pass

&ensp; &ensp; &ensp; &ensp; In the **forward pass**, the input data flows through the network, layer by layer, using the formula described before. In section [1.3. Neurons](1.3), we show that the output of a neuron was given by the formula:

$$
\small
h
=
\sigma\left(
\begin{bmatrix}
x_{1} & x_{2} & x_{3} & \dots & x_{d}
\end{bmatrix}
\times
\begin{bmatrix}
w_{1} \\
w_{2} \\
w_{3} \\
\vdots \\
w_{d}
\end{bmatrix}
+
b
\right)
$$

<p style="margin-top: 0;">where:</p>

- $d$: Dimensionality of the input vector.

<br>

The output of a layer is:

$$
\small
\begin{bmatrix}
h_{1} & h_{2} & h_{3} & \dots & h_{m}
\end{bmatrix}
=
\sigma\left(
\begin{bmatrix}
x_{1} & x_{2} & x_{3} & \dots & x_{d}
\end{bmatrix}
\cdot
\begin{bmatrix}
w_{11} & w_{12} & w_{13} & \dots & w_{1m} \\
w_{21} & w_{22} & w_{23} & \dots & w_{2m} \\
w_{31} & w_{32} & w_{33} & \dots & w_{3m} \\
\vdots & \vdots & \vdots & \ddots & \vdots \\
w_{d1} & w_{d2} & w_{d3} & \dots & w_{dm} \\
\end{bmatrix}
+
\begin{bmatrix}
b_{1} & b_{2} & b_{3} & \dots & b_{m}
\end{bmatrix}
\right)
$$

<p style="margin-top: 0;">where:</p>

- $d$: Dimensionality of the input vector.
- $m$: Number of neurons in the layer.

<br>

**Matrix multiplication** enables us to efficiently calculate in parallel the output of a layer for several input examples:

$$
\small
\begin{bmatrix}
h_{11} & h_{12} & h_{13} & \dots & h_{1m} \\
h_{21} & h_{22} & h_{23} & \dots & h_{2m} \\
h_{31} & h_{32} & h_{33} & \dots & h_{3m} \\
\vdots & \vdots & \vdots & \ddots & \vdots \\
h_{N1} & h_{N2} & h_{N3} & \dots & h_{Nm}
\end{bmatrix}
=
\sigma\left(
\begin{bmatrix}
x_{11} & h_{12} & h_{13} & \dots & h_{1d} \\
x_{21} & h_{22} & h_{23} & \dots & h_{2d} \\
x_{31} & h_{32} & h_{33} & \dots & h_{3d} \\
\vdots & \vdots & \vdots & \ddots & \vdots \\
x_{N1} & h_{N2} & h_{N3} & \dots & h_{Nd}
\end{bmatrix}
\times
\begin{bmatrix}
w_{11} & w_{12} & w_{13} & \dots & w_{1m} \\
w_{21} & w_{22} & w_{23} & \dots & w_{2m} \\
w_{31} & w_{32} & w_{33} & \dots & w_{3m} \\
\vdots & \vdots & \vdots & \ddots & \vdots \\
w_{d1} & w_{d2} & w_{d3} & \dots & w_{dm} \\
\end{bmatrix}
+
\begin{bmatrix}
b_{1} & b_{2} & b_{3} & \dots & b_{k}
\end{bmatrix}
\right)
$$

<p style="margin-top: 0;">where:</p>

- $d$: Dimensionality of the input vector.
- $m$: Number of neurons in the layer.
- $N$: Number of examples in a batch.






The outputs of the 20 neurons in the first layer for the 16 examples would be: b

$$
\begin{bmatrix}
h_{1,1} & h_{1,2} & h_{1,3} & \dots & h_{1,20} \\
h_{2,1} & h_{2,2} & h_{2,3} & \dots & h_{2,20} \\
h_{3,1} & h_{3,2} & h_{3,3} & \dots & h_{3,20} \\
\vdots & \vdots & \vdots & \ddots & \vdots \\
h_{16,1} & h_{16,2} & h_{16,3} & \dots & h_{16,20}
\end{bmatrix}
=
\begin{bmatrix}
x_{1,1} & x_{1,2} & x_{1,3} \\
x_{2,1} & x_{2,2} & x_{2,3} \\
x_{3,1} & x_{3,2} & x_{3,3} \\
\vdots & \vdots & \vdots \\
x_{16,1} & x_{16,2} & x_{16,3}
\end{bmatrix}
\times
\begin{bmatrix}
w_{1,1} & w_{1,2} & w_{1,3} & \dots & w_{1,20} \\
w_{2,1} & w_{2,2} & w_{2,3} & \dots & w_{2,20} \\
w_{3,1} & w_{3,2} & w_{3,3} & \dots & w_{3,20} \\
\end{bmatrix}
$$

$\text{where } h_{11} = x_{1}w_{1} + x_{2}w_{2} + x_{3}w_{3}$

In [43]:
# matrix multiplication
h1 = x @ W1     # (16,20) = (16,3) x (3,20)
h2 = h1 @ W2    # (16,8) = (16,20) x (20,8)

## Loss Function

&ensp; &ensp; &ensp; &ensp; A **loss function** is a mathematical representation that quantifies how well a machine learning model is performing. It measures the difference between the model's predicted outputs and the actual outputs from the dataset (the **targets**).

There are various types of loss functions, each suitable for different tasks:
- Regression Tasks: Mean Squared Error (MSE) and Mean Absolute Error (MAE) (both covered in section [](1.8)).
- Classification Tasks: Cross-Entropy Loss (covered in [](2.7)) and Hinge Loss.


The network's output is compared to the actual values, and a loss function is used to measure the prediction error. 




### Mean Squared Error

The **Mean Squared Error (MSE)** is a commonly used loss function for regression tasks, where the objective is to predict continuous values. MSE computes the average of the squared differences between the predicted values ($\hat{y}_i$) and actual values($y_i$). The squaring of the errors results in larger penalties for bigger discrepancies, making MSE particularly sensitive to outliers compared to the **Mean Absolute Error (MAE)**, which treats all errors equally.

The formulas for MSE and MAE are as follows:

$$
\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
$$

$$
\text{MAE} = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|
$$

We will first define the following targets, which are the values to which the neural network's output should converge.

In [7]:
print(f"Targets:\n{targets}\n")

Targets:
tensor([[ 1.1000, -1.3000],
        [ 2.3000,  2.0000],
        [-2.5000, -3.2000],
        [ 0.2000,  4.3000],
        [ 0.6000,  5.3000]])



We will then calculate the loss using the Mean Squared Error.

```{note}
The `tensor.item()` method returns the value of a single-element tensor. For more information, please refer to the [PyTorch documentation](https://pytorch.org/docs/stable/generated/torch.Tensor.item.html).
```

In [8]:
# calculate loss using MSE
loss = ((h2 - targets) ** 2).mean()
print(f"Loss: {loss.item():.4f}")

Loss: 0.0183


## Backpropagation

The network calculates gradients by propagating the error backward through the layers, determining how much each parameter contributed to the error. During **backpropagation**, the neural network computes the **gradient of the loss function** with respect to all its parameters by applying the chain rule of calculus.

To compute all the gradients, we will use the PyTorch function `loss.backward()` which kept track of all the operations during the forwards pass. In  [](ch3.ipynb) we will perform a backward pass manually to better understand how backpropagation works and how the gradients are calculated.

In [9]:
# set the gradients to None
for p in parameters:
    p.grad = None

# calculate gradients
loss.backward()

## Update Parameters

The calculated gradients are used to adjust the weights and biases through an optimization algorithm to minimize the loss function. Once the gradients have been computed, the neural network progresively **updates its parameters** trying minimize the loss function. The most common optimization techniques for updating the parameters are **gradient descent**, **stochastic gradient descent (SGD)**, and **Adam**.

A **gradient** is a vector that represents the rate of change of a function with respect to its input variables. By default, gradients point in the **direction of steepest ascent**, that is, the direction in which the function increases the fastest. Since we want to minimize the loss function, we will move in the opposite direction of the gradients. 



### Gradient Descent

**Gradient descent** is an optimization algorithm used to minimize a function, frequently used in neural networks training to minimize the loss function.

Gradient descent updates each parameter by subtracting a fraction of the gradient from its current value, scaled by a factor known as the **learning rate**. The learning rate determines the size of the steps we take towards the minimum. A smaller learning rate results in more precise but slower convergence, while a larger learning rate can speed up convergence but may risk overshooting the minimum. The parameter update rule for gradient descent can be expressed mathematically as:

$$
\theta = \theta - \eta \nabla L(\theta)
$$

Where:
- $\theta$ represents the parameters of the neural network
- $\eta$ is the learning rate
- $\nabla L(\theta)$ is the gradient of the loss function with respect to the parameters.

In [10]:
lr = 0.1 # learning rate

# clone for later comparison 
old_W1 = W1.data.clone()
old_W2 = W2.data.clone()

# update parameters using gradient descent
for p in parameters:
    p.data += -lr * p.grad

print(f"W1 before GD:\n{old_W1}")
print(f"W1 gradients:\n{W1.grad}")
print(f"W1 after GD:\n{W1.data}\n")

print(f"W2 before GD:\n{old_W2}")
print(f"W2 gradients:\n{W2.grad}")
print(f"W2 after GD2:\n{W2.data}")

W1 before GD:
tensor([[-1.5256, -0.7502,  0.6995,  0.1991,  0.8657,  0.2444],
        [-0.6629,  0.8073,  0.4391,  1.1712, -2.2456, -1.4465],
        [ 0.0612, -0.6177, -0.7981, -0.1316, -0.7984,  0.3357]])
W1 gradients:
tensor([[ 0.0536, -0.0771,  0.2550,  0.1593,  0.0444,  0.2188],
        [-0.0216,  0.0368, -0.0499, -0.0180, -0.0137, -0.0411],
        [ 0.0185, -0.0187,  0.1620,  0.1197,  0.0213,  0.1415]])
W1 after GD:
tensor([[-1.5310, -0.7425,  0.6740,  0.1832,  0.8613,  0.2225],
        [-0.6608,  0.8036,  0.4441,  1.1730, -2.2442, -1.4423],
        [ 0.0593, -0.6159, -0.8143, -0.1436, -0.8006,  0.3216]])

W2 before GD:
tensor([[ 0.3935,  1.1322],
        [-0.5404, -2.2102],
        [ 2.1130, -0.0040],
        [ 1.3800, -1.3505],
        [ 0.3455,  0.5046],
        [ 1.8213, -0.1814]])
W2 gradients:
tensor([[-0.1637, -0.0016],
        [-0.1570, -0.0064],
        [ 0.0129,  0.0072],
        [-0.0138, -0.0103],
        [ 0.0964,  0.0373],
        [ 0.0894,  0.0136]])
W2 after GD2:


```{note}
Please note that the gradients indicate whether the values should be increased or decreased, as well as the magnitude of the adjustment.
```


## Neural Network Training

During **training**, a neural network iteratively makes a forward pass, calcules the loss, makes a backward pass and updates its parameters. Each iteration is usually called a **training step**.

In [11]:
# intitialize randomly weight matrices
g = torch.Generator().manual_seed(1)
W1 = torch.randn((3, 6), generator=g)   # (input_to_layer, output_from_layer)
W2 = torch.randn((6, 2), generator=g)   # (input_to_layer, output_from_layer)


# parameters
parameters = [W1, W2]

for p in parameters:
    p.requires_grad = True

print(f"Number of parameters: {sum(p.nelement() for p in parameters)}")

Number of parameters: 30


In [12]:
max_steps = 15      # train iterations
lr = 0.1            # learning rate

# list to keep track of the loss in each step
track_loss = []

for step in range(max_steps):

    # forward pass
    h1 = x @ W1     # (5,6) = (5,3) x (3,6)
    h2 = h1 @ W2    # (5,2) = (5,6) x (6,2)

    # calculate loss
    loss = ((h2 - targets) ** 2).mean()
    track_loss.append(loss.item())
    print(f"Step: {step:2d}/{max_steps}     Loss: {loss.item():.4f}")

    # backward pass
    for p in parameters:
        p.grad = None
    loss.backward()

    # update parameters
    for p in parameters:
        p.data += -lr * p.grad

Step:  0/15     Loss: 0.0183
Step:  1/15     Loss: 0.0062
Step:  2/15     Loss: 0.0054
Step:  3/15     Loss: 0.0051
Step:  4/15     Loss: 0.0049
Step:  5/15     Loss: 0.0047
Step:  6/15     Loss: 0.0045
Step:  7/15     Loss: 0.0044
Step:  8/15     Loss: 0.0043
Step:  9/15     Loss: 0.0042
Step: 10/15     Loss: 0.0041
Step: 11/15     Loss: 0.0040
Step: 12/15     Loss: 0.0040
Step: 13/15     Loss: 0.0039
Step: 14/15     Loss: 0.0039


In [13]:
plt.plot(track_loss)
plt.xlabel('Step')
plt.ylabel('Loss');

NameError: name 'plt' is not defined

As we can see, through gradient descent, the neural network gradually reduced its loss and improved its predictions over time. Comparing the initial output to the final output, we can observe that the final predictions are closer to the target values.

In [None]:
print(f"Initial output:\n{old_h2}")
print(f"Final output:\n{h2}")
print(f"Targets:\n{targets}")

First output:
tensor([[ 1.0329, -1.2264],
        [ 2.3319,  2.0953],
        [-2.8955, -3.1665],
        [ 0.1669,  4.3554],
        [ 0.5632,  5.2940]])
Last output:
tensor([[ 0.9563, -1.2406],
        [ 2.2978,  2.0103],
        [-2.5604, -3.1827],
        [ 0.2316,  4.3254],
        [ 0.5060,  5.2981]], grad_fn=<MmBackward0>)
Targets:
tensor([[ 1.1000, -1.3000],
        [ 2.3000,  2.0000],
        [-2.5000, -3.2000],
        [ 0.2000,  4.3000],
        [ 0.6000,  5.3000]])


<br>