# Artificial Neural Networks

# 1. What Is an Artificial Neural Network?

An **artificial neural network (ANN)** is a function built by connecting simple computational units called *neurons*.
Mathematically, an ANN with layers $1,\dots,L$ computes a function
$$
f_\theta(x) = A_L \circ \phi_{L-1} \circ A_{L-1} \circ \cdots \circ \phi_1 \circ A_1(x),
$$
where each $A_i$ is an affine map, and each $\phi_i$ is an activation applied componentwise.

Even though each neuron is simple, many connected together can represent very complicated functions—this is the main reason ANNs power modern AI (vision, speech, robotics, etc.).



# 2. Online Learning

**Online learning** means learning from data that arrives one piece at a time:
$$(x_1,y_1), (x_2,y_2),\dots$$

At time $t$ an online algorithm updates its parameters $\theta_t$ using only the data seen so far. This is how systems like speech recognition or financial prediction operate. In contrast, **batch learning** would require all data at once.

Online learning is more realistic in many real-world settings and more efficient when datasets are huge.



# 3. Neurons

A biological neuron fires when its input exceeds a threshold. The simplest mathematical model replicating this idea uses the **step function**:

$$
H(x) =
\begin{cases}
1, & x\ge 0,\
0, & x<0.
\end{cases}
$$

To give a neuron multiple inputs, we compute a *weighted sum* and then apply the activation:

$$
y = H(w\cdot x + b).
$$

The boundary between firing and not firing is the hyperplane
$$
w\cdot x + b = 0.
$$

This makes the neuron a **linear classifier**.

![title](img/picture.png)



# 4. The Sigmoid Function

The step function is useful conceptually but too “hard” for learning. A smoother alternative that behaves similarly is the **sigmoid**:

$$
\sigma(x) = \frac{1}{1+e^{-x}}.
$$

It stays between $0$ and $1$ and transitions from low to high around $x=0$. It is differentiable, which is essential for gradient-based learning.

![title](img/picture2.png)



# 5. Viewing ANNs as Graphs

An ANN can be drawn as a **directed graph**:

* each node is a neuron,
* each directed edge carries a number (the output of another neuron),
* edges have weights.

If no cycles occur, the graph is a **directed acyclic graph (DAG)**, meaning computation flows forward only. These are **feedforward networks**.

If cycles are allowed, we obtain **recurrent neural networks (RNNs)**, used for sequence data.

![title](img/picture3.png)



# 6. Layers

Neurons that depend on the same preceding computations form a **layer**.
Thus an ANN is typically grouped as:

* **input layer** $l_0$
* **hidden layers** $l_1, \dots, l_{L-1}$
* **output layer** $l_L$

Layers help us visualize how information flows through the network.

![title](img/picture5.png)



# 7. The Universal Approximation Theorem

One of the most important facts about neural networks is:

## Theorem (Universal Approximation, simplified)

Let $K\subset\mathbb{R}^n$ be compact and let $f:K\to\mathbb{R}$ be continuous.
Then for any $\varepsilon>0$ there exists a neural network with **one hidden layer** and a sigmoidal activation such that
$$
|f(x) - \hat f(x)| < \varepsilon
\quad\text{for all } x\in K.
$$

In short: **neural networks with enough hidden units can approximate any continuous function as closely as we want.**

This does *not* say learning is easy; it only establishes that the representation is powerful enough.



# 8. Training: Loss Functions

Given data $(x_i,y_i)$, we measure how well a network $f_\theta$ fits them via a **loss function**. A common choice is the empirical loss:

$$
E(\theta) = \frac{1}{N} \sum_{i=1}^N L(f_\theta(x_i), y_i).
$$

Typical losses:

* **MSE** for regression: $L(u,v)=|u-v|^2$
* **Cross-entropy** for classification

We want to adjust $\theta$ so that $E(\theta)$ is small.


# 9. Gradient Descent

Because $E(\theta)$ is usually too complicated to minimize exactly, we use **gradient descent**:

$$
\theta_{t+1} = \theta_t - \eta_t \nabla_\theta E(\theta_t),
$$

where $\eta_t$ is a learning rate.

Intuition: the gradient points uphill; subtracting it moves downhill and reduces the loss.

When data arrives one point at a time, the same idea yields **stochastic / online gradient descent**, which adjusts $\theta$ gradually and continuously.



# 10. Backpropagation

**Backpropagation** is the algorithm that makes gradient descent feasible for deep networks. It applies the chain rule efficiently through the layers.

Key idea:

1. Compute the outputs forward.
2. Compute how changes in each neuron affect the final loss by propagating “error signals” backward.
3. Use these signals to compute gradients with respect to all weights and biases.

## Theorem (Correctness of Backpropagation, intuitive)

For any feedforward network with differentiable activations, backpropagation computes the true gradient
$$
\nabla_\theta E(\theta)
$$
using a number of operations proportional to the number of connections in the network.

This made training deep networks practical.

For recurrent networks, the same idea is applied to the **unfolded** network over time, yielding *Backpropagation Through Time (BPTT)*.



# 11. Putting It All Together

An ANN:

* takes in an input vector $x$,
* transforms it through layers of affine maps and activations,
* outputs a prediction $f_\theta(x)$,
* and learns by adjusting weights and biases to reduce a loss function.

Despite its simple components, the network can learn highly nonlinear functions, thanks to the universal approximation theorem, gradient-based optimization, and the expressive power of layered computation.


# Backpropagation

We consider a feedforward network with $L$ layers. Dimensions:

* Input: $a_0 = x \in \mathbb{R}^{n_0}$
* Layer $\ell$ pre-activation: $z_\ell \in \mathbb{R}^{n_\ell}$
* Layer $\ell$ activation: $a_\ell \in \mathbb{R}^{n_\ell}$
* Weights: $W_\ell \in \mathbb{R}^{n_\ell \times n_{\ell-1}}$
* Biases: $b_\ell \in \mathbb{R}^{n_\ell}$

The forward propagation is

$
z_\ell = W_\ell a_{\ell-1} + b_\ell,
\qquad
a_\ell = \phi_\ell(z_\ell),
$

where $\phi_\ell$ acts elementwise.

We use the following notation:

* $\phi_\ell'(z_\ell)$ is the vector of elementwise derivatives.
* $\operatorname{diag}(\phi_\ell'(z_\ell))$ is the diagonal Jacobian matrix of $\phi_\ell$ at $z_\ell$.
* $\odot$ denotes elementwise (Hadamard) product.

# Forward Pass (Vectorized)

For each layer $\ell = 1,\dots,L$:

$
z_\ell = W_\ell a_{\ell-1} + b_\ell,
\qquad
a_\ell = \phi_\ell(z_\ell).
$

If the final output is $a_L$ and the target is $y$, the loss is $L(a_L, y)$.

# Error Signal

Define the **error signal** at layer $\ell$:

$$
\delta_\ell := \frac{\partial L}{\partial z_\ell} \in \mathbb{R}^{n_\ell}.
$$

This is central because

$$
\frac{\partial L}{\partial W_\ell} = \delta_\ell a_{\ell-1}^\top,
\qquad
\frac{\partial L}{\partial b_\ell} = \delta_\ell.
$$

Once $\delta_\ell$ is known, computing gradients is straightforward.

# Output Layer

For the output layer $\ell = L$:

$$
\delta_L = \frac{\partial L}{\partial a_L} \odot \phi_L'(z_L).
$$

Equivalently, in Jacobian form:

$$
\delta_L = \operatorname{diag}(\phi_L'(z_L)) \frac{\partial L}{\partial a_L}.
$$

This works for any differentiable loss.

# Hidden Layers

For hidden layers $\ell = L-1,\dots,1$:

$$
\delta_\ell = (W_{\ell+1}^\top \delta_{\ell+1}) \odot \phi_\ell'(z_\ell).
$$

Or equivalently in full matrix form:

$$
\delta_\ell = \operatorname{diag}(\phi_\ell'(z_\ell)) W_{\ell+1}^\top \delta_{\ell+1}.
$$

# Gradients with Respect to Parameters

Differentiating $z_\ell = W_\ell a_{\ell-1} + b_\ell$ gives

$$
\frac{\partial z_\ell}{\partial W_\ell} = a_{\ell-1}^\top,
\qquad
\frac{\partial z_\ell}{\partial b_\ell} = I.
$$

Then

$$
\frac{\partial L}{\partial W_\ell} = \delta_\ell a_{\ell-1}^\top,
\qquad
\frac{\partial L}{\partial b_\ell} = \delta_\ell.
$$

# Backpropagation Summary

**Forward pass** ($\ell = 1,\dots,L$):

$$
z_\ell = W_\ell a_{\ell-1} + b_\ell,
\qquad
a_\ell = \phi_\ell(z_\ell).
$$

**Backward pass**:

$$
\delta_L = \frac{\partial L}{\partial a_L} \odot \phi_L'(z_L),
\qquad
\delta_\ell = (W_{\ell+1}^\top \delta_{\ell+1}) \odot \phi_\ell'(z_\ell), \ \ell = L-1,\dots,1.
$$

**Gradients**:

$$
\frac{\partial L}{\partial W_\ell} = \delta_\ell a_{\ell-1}^\top,
\qquad
\frac{\partial L}{\partial b_\ell} = \delta_\ell.
$$

Everything follows from standard multivariable calculus and matrix differentials.

# Gradient Descent and Convergence

Consider gradient descent on a differentiable function $f:\mathbb{R}^p \to \mathbb{R}$:

$$
\theta_{k+1} = \theta_k - \eta \nabla f(\theta_k).
$$

Assume:

* $f$ is convex.
* $\nabla f$ is $L$-Lipschitz: $|\nabla f(u) - \nabla f(v)| \le L|u-v|$.
* Learning rate $0 < \eta < 2/L$.

## Monotone Decrease

By $L$-smoothness:

$$
f(v) \le f(u) + \nabla f(u)^\top (v-u) + \frac{L}{2}|v-u|^2.
$$

Apply $u = \theta_k$, $v = \theta_{k+1} = \theta_k - \eta \nabla f(\theta_k)$:

$$
f(\theta_{k+1}) \le f(\theta_k) - \left(\eta - \frac{L \eta^2}{2}\right) |\nabla f(\theta_k)|^2.
$$

Since $0 < \eta < 2/L$, the coefficient is positive, so $f(\theta_{k+1}) \le f(\theta_k)$.

## Convergence Rate

For any minimizer $\theta^*$, convexity gives:

$$
f(\theta_k) - f(\theta^*) \le \nabla f(\theta_k)^\top (\theta_k - \theta^*).
$$

Standard algebra yields:

$$
f(\theta_k) - f(\theta^*) \le \frac{|\theta_0 - \theta^*|^2}{2\eta k},
$$

so $f(\theta_k) \to f(\theta^*)$ as $k \to \infty$ (sublinear rate $O(1/k)$).

# Remarks for Neural Networks

* Losses are not convex, so global convergence is not guaranteed.
* Smoothness still implies that small learning rates decrease the loss.
* Empirically, large neural networks tend to have many local minima with similar loss values, and saddle points dominate the landscape.
