# Feedforward Neural Networks

**Deep feedforward networks** (or **feedforward neural networks**, or **multilayer perceptrons**, MLPs) are fundamental models in deep learning. Their goal is to approximate a target function $f^*$.

**Example:**

For a classifier, $y = f^*(x)$ maps input $x$ to a category $y$. A feedforward network defines $y = f(x;\theta)$ and learns parameters $\theta$ to best approximate $f^*$.

These models are called **feedforward** because information flows in one direction: there are no **feedback** connections. Networks with feedback are **recurrent neural networks**. Convolutional networks for image recognition are a specialized type of feedforward network. Feedforward networks are represented as a composition of functions over a **directed acyclic graph (DAG)**.

**Example:**

Given three layers $f^{(1)}, f^{(2)}, f^{(3)}$, the network computes:

$$
f(x) = f^{(3)}(f^{(2)}(f^{(1)}(x))))
$$

This is a network of **depth 3**. The last layer is called the **output layer**, while intermediate layers are **hidden layers**. Each hidden layer is typically vector-valued, with dimensionality defining the **width** of the network.

Training a feedforward network involves:

1. Choosing the **optimizer**, **loss function**, and output unit type.
2. Selecting **activation functions** for hidden layers.
3. Designing the **architecture**: number of layers, connectivity, and number of units per layer.

Feedforward networks are used primarily for **supervised learning** with non-sequential data. Recurrent networks handle sequential data, mapping variable-length sequences $X_k = {x_1, \dots, x_k}$ to outputs $y_k$.

## Single-layer Perceptron

A **perceptron** is the simplest feedforward network with **no hidden layers**, only input and output layers. Its output is:

$$
o = g(w \cdot x + b)
$$

where $g$ is the activation function (identity, sigmoid $\sigma(x) = (1 + e^{-x})^{-1}$, or hyperbolic tangent $tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$).

**Learning:** The perceptron adjusts weights $$w$$ and bias $$b$$ to classify examples $(x, y)$.

![title](img/picture4.png)

### Error Function

A **loss function** quantifies how well the perceptron matches the target:

$$
E(X) = \frac{1}{2N} \sum_{i=1}^N (o_i - y_i)^2 = \frac{1}{2N} \sum_{i=1}^N (g(w \cdot x_i + b) - y_i)^2
$$

Minimizing $E(X)$ with respect to $w$ and $b$ improves classification accuracy.

### Delta Rule

Weights and bias are updated using **gradient descent**:

$$
w_{i+1} = w_i - \alpha \frac{\partial E(X)}{\partial w_i}, \quad
b_{i+1} = b_i - \alpha \frac{\partial E(X)}{\partial b_i}
$$

where $\alpha$ is the **learning rate**. The **delta rule**, a special case of backpropagation, computes:

$$
\Delta w = \frac{\alpha}{N} \sum_{i=1}^N (y_i - o_i) g'(h_i) x_i, \quad
\Delta b = \frac{\alpha}{N} \sum_{i=1}^N (y_i - o_i) g'(h_i)
$$

with $h_i = w \cdot x_i + b$, $o_i = g(h_i)$.

### Training Procedure

1. **Forward pass:** Compute $h_i$ and $o_i$ for all inputs.
2. **Backward pass:** Update $w$ and $b$ using the delta rule.

## Limitations

Single-layer perceptrons are **linear classifiers**, meaning they can only separate linearly separable data. For example, the XOR function is not linearly separable:

![title](img/picture5.png)

Nonlinear functions require **multiple layers**, leading to **multi-layer perceptrons (MLPs)**.

## Multi-layer Perceptron (MLP)

An MLP consists of multiple layers of perceptrons, capable of learning **nonlinear functions**, making them suitable for regression and classification.

### Layers

MLPs consist of:

* Input layer
* One or more hidden layers
* Output layer

![title](img/picture6.png)

Feedforward networks form a DAG; outputs of a layer depend only on the previous layer.

## Formal Definition

1. Hidden layers use activation $g$; output layer uses $g_0$.
2. Each perceptron in layer $l_k$ is **fully connected** to all perceptrons in $l_{k-1}$.
3. No connections exist between perceptrons in the same layer.

![title](img/picture7.png)

**Notation:**

* Scalars: $w_{ij}^k$, $b_i^k$, $h_i^k$, $o_i^k$, $r_k$
* Vectors: $w_i^k$, $o^k$$

**Forward computation:**

1. Input layer: $o_i^0 = x_i$
2. Hidden layers $l_1 \dots l_{m-1}$:

$$
h_i^k = w_i^k \cdot o^{k-1} + b_i^k, \quad o_i^k = g(h_i^k)
$$

3. Output layer:

$$
h_1^m = w_1^m \cdot o^{m-1} + b_1^m, \quad o = g_0(h_1^m)
$$

## Training MLPs

Minimize the mean squared error:

$$
E(X) = \frac{1}{2N} \sum_{i=1}^N (o_i - y_i)^2
$$

Gradient descent updates:

$$
\Delta w_{ij}^k = -\alpha \frac{\partial E(X)}{\partial w_{ij}^k}, \quad
\Delta b_i^k = -\alpha \frac{\partial E(X)}{\partial b_i^k}
$$

Backpropagation efficiently computes these derivatives by propagating gradients **backwards** from the output layer. Training proceeds in two phases:

1. **Forward pass:** Compute all $h_i^k$ and $o_i^k$.
2. **Backward pass:** Compute gradients and update weights and biases.
