# Artificial Neural Networks

# 1. What Is an Artificial Neural Network?

An **artificial neural network (ANN)** is a function built by connecting simple computational units called *neurons*.
Mathematically, an ANN with layers $1,\dots,L$ computes a function
$$
f_\theta(x) = A_L \circ \phi_{L-1} \circ A_{L-1} \circ \cdots \circ \phi_1 \circ A_1(x),
$$
where each $A_i$ is an affine map, and each $\phi_i$ is an activation applied componentwise.

Even though each neuron is simple, many connected together can represent very complicated functions—this is the main reason ANNs power modern AI (vision, speech, robotics, etc.).



# 2. Online Learning

**Online learning** means learning from data that arrives one piece at a time:
$$(x_1,y_1), (x_2,y_2),\dots$$

At time $t$ an online algorithm updates its parameters $\theta_t$ using only the data seen so far. This is how systems like speech recognition or financial prediction operate. In contrast, **batch learning** would require all data at once.

Online learning is more realistic in many real-world settings and more efficient when datasets are huge.



# 3. Neurons

A biological neuron fires when its input exceeds a threshold. The simplest mathematical model replicating this idea uses the **step function**:

$$
H(x) =
\begin{cases}
1, & x\ge 0,\
0, & x<0.
\end{cases}
$$

To give a neuron multiple inputs, we compute a *weighted sum* and then apply the activation:

$$
y = H(w\cdot x + b).
$$

The boundary between firing and not firing is the hyperplane
$$
w\cdot x + b = 0.
$$

This makes the neuron a **linear classifier**.

![title](img/picture.png)



# 4. The Sigmoid Function

The step function is useful conceptually but too “hard” for learning. A smoother alternative that behaves similarly is the **sigmoid**:

$$
\sigma(x) = \frac{1}{1+e^{-x}}.
$$

It stays between $0$ and $1$ and transitions from low to high around $x=0$. It is differentiable, which is essential for gradient-based learning.

![title](img/picture2.png)



# 5. Viewing ANNs as Graphs

An ANN can be drawn as a **directed graph**:

* each node is a neuron,
* each directed edge carries a number (the output of another neuron),
* edges have weights.

If no cycles occur, the graph is a **directed acyclic graph (DAG)**, meaning computation flows forward only. These are **feedforward networks**.

If cycles are allowed, we obtain **recurrent neural networks (RNNs)**, used for sequence data.

![title](img/picture3.png)



# 6. Layers

Neurons that depend on the same preceding computations form a **layer**.
Thus an ANN is typically grouped as:

* **input layer** $l_0$
* **hidden layers** $l_1, \dots, l_{L-1}$
* **output layer** $l_L$

Layers help us visualize how information flows through the network.

![title](img/picture5.png)



# 7. The Universal Approximation Theorem

One of the most important facts about neural networks is:

## Theorem (Universal Approximation, simplified)

Let $K\subset\mathbb{R}^n$ be compact and let $f:K\to\mathbb{R}$ be continuous.
Then for any $\varepsilon>0$ there exists a neural network with **one hidden layer** and a sigmoidal activation such that
$$
|f(x) - \hat f(x)| < \varepsilon
\quad\text{for all } x\in K.
$$

In short: **neural networks with enough hidden units can approximate any continuous function as closely as we want.**

This does *not* say learning is easy; it only establishes that the representation is powerful enough.



# 8. Training: Loss Functions

Given data $(x_i,y_i)$, we measure how well a network $f_\theta$ fits them via a **loss function**. A common choice is the empirical loss:

$$
E(\theta) = \frac{1}{N} \sum_{i=1}^N L(f_\theta(x_i), y_i).
$$

Typical losses:

* **MSE** for regression: $L(u,v)=|u-v|^2$
* **Cross-entropy** for classification

We want to adjust $\theta$ so that $E(\theta)$ is small.


# 9. Gradient Descent

Because $E(\theta)$ is usually too complicated to minimize exactly, we use **gradient descent**:

$$
\theta_{t+1} = \theta_t - \eta_t \nabla_\theta E(\theta_t),
$$

where $\eta_t$ is a learning rate.

Intuition: the gradient points uphill; subtracting it moves downhill and reduces the loss.

When data arrives one point at a time, the same idea yields **stochastic / online gradient descent**, which adjusts $\theta$ gradually and continuously.



# 10. Backpropagation

**Backpropagation** is the algorithm that makes gradient descent feasible for deep networks. It applies the chain rule efficiently through the layers.

Key idea:

1. Compute the outputs forward.
2. Compute how changes in each neuron affect the final loss by propagating “error signals” backward.
3. Use these signals to compute gradients with respect to all weights and biases.

## Theorem (Correctness of Backpropagation, intuitive)

For any feedforward network with differentiable activations, backpropagation computes the true gradient
$$
\nabla_\theta E(\theta)
$$
using a number of operations proportional to the number of connections in the network.

This made training deep networks practical.

For recurrent networks, the same idea is applied to the **unfolded** network over time, yielding *Backpropagation Through Time (BPTT)*.



# 11. Putting It All Together

An ANN:

* takes in an input vector $x$,
* transforms it through layers of affine maps and activations,
* outputs a prediction $f_\theta(x)$,
* and learns by adjusting weights and biases to reduce a loss function.

Despite its simple components, the network can learn highly nonlinear functions, thanks to the universal approximation theorem, gradient-based optimization, and the expressive power of layered computation.


# Backpropagation

We consider a feedforward network with $L$ layers. Dimensions:

* Input: $a_0 = x \in \mathbb{R}^{n_0}$
* Layer $\ell$ pre-activation: $z_\ell \in \mathbb{R}^{n_\ell}$
* Layer $\ell$ activation: $a_\ell \in \mathbb{R}^{n_\ell}$
* Weights: $W_\ell \in \mathbb{R}^{n_\ell \times n_{\ell-1}}$
* Biases: $b_\ell \in \mathbb{R}^{n_\ell}$

Forward propagation:

$
z_\ell = W_\ell a_{\ell-1} + b_\ell,
\qquad
a_\ell = \phi_\ell(z_\ell)
$

Notation:

* $\phi_\ell'(z_\ell)$ elementwise derivative
* $\operatorname{diag}(\phi_\ell'(z_\ell))$ diagonal Jacobian
* $\odot$ elementwise product

# Forward Pass (Vectorized)

For $\ell=1,\dots,L$:

$
z_\ell = W_\ell a_{\ell-1} + b_\ell,
\qquad
a_\ell = \phi_\ell(z_\ell)
$

Loss: $L(a_L,y)$.

# Error Signal

Define

$$
\delta_\ell := \frac{\partial L}{\partial z_\ell}
$$

Then

$$
\frac{\partial L}{\partial W_\ell} = \delta_\ell a_{\ell-1}^\top,
\qquad
\frac{\partial L}{\partial b_\ell} = \delta_\ell
$$

# Output Layer

$$
\delta_L
= \frac{\partial L}{\partial a_L}\odot \phi_L'(z_L)
= \operatorname{diag}(\phi_L'(z_L))\frac{\partial L}{\partial a_L}
$$

# Hidden Layers

For $\ell=L-1,\dots,1$:

$$
\delta_\ell
= (W_{\ell+1}^\top \delta_{\ell+1}) \odot \phi_\ell'(z_\ell)
= \operatorname{diag}(\phi_\ell'(z_\ell))W_{\ell+1}^\top\delta_{\ell+1}
$$

# Parameter Gradients

$$
\frac{\partial z_\ell}{\partial W_\ell}=a_{\ell-1}^\top,
\qquad
\frac{\partial z_\ell}{\partial b_\ell}=I
$$

Thus,

$$
\frac{\partial L}{\partial W_\ell}=\delta_\ell a_{\ell-1}^\top,
\qquad
\frac{\partial L}{\partial b_\ell}=\delta_\ell
$$

# Backpropagation Summary

Forward:

$$
z_\ell=W_\ell a_{\ell-1}+b_\ell,\qquad a_\ell=\phi_\ell(z_\ell).
$$

Backward:

$$
\delta_L=\frac{\partial L}{\partial a_L}\odot \phi_L'(z_L),
\qquad
\delta_\ell=(W_{\ell+1}^\top\delta_{\ell+1})\odot \phi_\ell'(z_\ell)
$$

Gradients:

$$
\frac{\partial L}{\partial W_\ell}=\delta_\ell a_{\ell-1}^\top,
\qquad
\frac{\partial L}{\partial b_\ell}=\delta_\ell
$$

# Gradient Descent and Convergence

We study gradient descent on a differentiable function $f:\mathbb{R}^p \to \mathbb{R}$:

$$
\theta_{k+1} = \theta_k - \eta \nabla f(\theta_k).
$$

Assume:

* $f$ is convex
* $\nabla f$ is $L$-Lipschitz (smooth):
  $$
  |\nabla f(u) - \nabla f(v)| \le L |u-v|
  $$
  equivalently
  $$
  f(v) \le f(u) + \nabla f(u)^\top (v-u) + \frac{L}{2}|v-u|^2
  $$
* learning rate $0 < \eta < 2/L$

These are the standard assumptions for convergence of gradient descent.



# Monotone Decrease

Apply smoothness with
$u=\theta_k$,
$v=\theta_{k+1} = \theta_k - \eta\nabla f(\theta_k)$:

$$
f(\theta_{k+1})
\le
f(\theta_k) - \eta |\nabla f(\theta_k)|^2 + \frac{L}{2}\eta^2 |\nabla f(\theta_k)|^2.
$$

Factor:

$$
f(\theta_{k+1})
\le
f(\theta_k)
-
\left(\eta - \frac{L\eta^2}{2}\right)
|\nabla f(\theta_k)|^2.
$$

Because $0<\eta<2/L$, the coefficient is positive, so

$$
f(\theta_{k+1}) \le f(\theta_k),
$$

with strict inequality unless $\nabla f(\theta_k)=0$.

Gradient descent is therefore a **descent method**.



# Bound on Gradient Norms

From the same inequality:

$$
f(\theta_k)-f(\theta_{k+1})
\ge
\left(\eta - \frac{L\eta^2}{2}\right)|\nabla f(\theta_k)|^2.
$$

Summing for $k=0,\dots,T-1$:

$$
\sum_{k=0}^{T-1}|\nabla f(\theta_k)|^2
\le
\frac{f(\theta_0)-f(\theta^*)}
{\eta - \frac{L\eta^2}{2}}.
$$

Thus the smallest gradient norm up to step $T$ satisfies

$$
\min_{k<T} |\nabla f(\theta_k)|^2
\le
\frac{f(\theta_0)-f(\theta^*)}{T\left(\eta - \frac{L\eta^2}{2}\right)},
$$

so $|\nabla f(\theta_k)| \to 0$ at rate $O(1/\sqrt{k})$
and the squared norm decreases as $O(1/k)$.


# Convergence of Function Values

We want to bound the suboptimality

$$
f(\theta_k)-f(\theta^*)
$$

for convex, $L$-smooth functions.

We use only **two facts**:

1. **Convexity**
   $$
   f(u) \ge f(v) + \nabla f(v)^\top(u-v)
   $$
   or equivalently
   $$
   f(v)-f(u) \le \nabla f(v)^\top (v-u).
   $$

2. **Smoothness**
   $$
   |\nabla f(x)-\nabla f(y)|\le L|x-y|.
   $$

Baillon–Haddad or Cocoercivity (for convex, L-smooth functions) and 1. implies an upper bound on the gradient:

$$
|\nabla f(x)|^2 \le 2L\bigl(f(x)-f(\theta^*)\bigr).
$$

(This follows by applying smoothness to the minimizer where $\nabla f(\theta^*)=0$.)

Expand the squared distance to the minimizer

Consider the basic identity:

$$
|\theta_{k+1}-\theta^*|^2
=
|\theta_k - \eta\nabla f(\theta_k) - \theta^*|^2.
$$

Expand:

$$
|\theta_{k+1}-\theta^*|^2
=
|\theta_k-\theta^*|^2 - 2\eta\nabla f(\theta_k)^\top(\theta_k-\theta^*)

+ \eta^2|\nabla f(\theta_k)|^2.
  $$

Use convexity to replace the inner product

Convexity gives:

$$
f(\theta_k)-f(\theta^*)
\le
\nabla f(\theta_k)^\top(\theta_k-\theta^*).
$$

Insert this lower bound into the previous expansion:

$$
|\theta_{k+1}-\theta^*|^2
\le
|\theta_k-\theta^*|^2 - 2\eta \bigl(f(\theta_k)-f(\theta^*)\bigr)

+ \eta^2|\nabla f(\theta_k)|^2.
  $$

Use smoothness and to upper-bound the gradient norm

Baillon–Haddad or Cocoercivity (for convex, L-smooth functions) and 1. implies:

$$
|\nabla f(\theta_k)|^2
\le
2L\bigl(f(\theta_k)-f(\theta^*)\bigr).
$$

Plug in:

$$
|\theta_{k+1}-\theta^*|^2
\le
|\theta_k-\theta^*|^2

- 2\eta \bigl(f(\theta_k)-f(\theta^*)\bigr)

+ 2L\eta^2 \bigl(f(\theta_k)-f(\theta^*)\bigr).
  $$

Factor the suboptimality:

$$
|\theta_{k+1}-\theta^*|^2
\le
|\theta_k-\theta^*|^2

- \eta(2 - \eta L)\bigl(f(\theta_k) - f(\theta^*)\bigr).
  $$

Because $0<\eta<2/L$, the factor $(2-\eta L)$ is positive.

Rearrange to isolate the suboptimality

$$
f(\theta_k)-f(\theta^*)
\le
\frac{|\theta_k-\theta^*|^2 - |\theta_{k+1}-\theta^*|^2}
{\eta(2-\eta L)}.
$$

This is the key inequality.

Sum over iterations

Sum from $k=0$ to $T-1$:

Left side is a sum of nonnegative terms:

$$
\sum_{k=0}^{T-1} \bigl(f(\theta_k)-f(\theta^*)\bigr)
\le
\frac{|\theta_0-\theta^*|^2}{\eta(2-\eta L)}.
$$

Since the loss decreases monotonically under gradient descent, we have

$$
f(\theta_T)-f(\theta^*)
\le
\frac{|\theta_0-\theta^*|^2}
{T\eta(2-\eta L)}.
$$

Up to constants, this is:

$$
f(\theta_T)-f(\theta^*)
= O\left(\frac{1}{T}\right).
$$

# Final Result

For convex, $L$-smooth $f$ and $0<\eta<2/L$:

$$
f(\theta_T)-f(\theta^*)
\le
\frac{|\theta_0-\theta^*|^2}
{T\eta(2-\eta L)}.
$$

Which gives the sublinear rate:

$$
f(\theta_T)-f(\theta^*) = O(1/T).
$$

This is the standard convergence guarantee for gradient descent in the convex smooth setting.


# Final Summary

Under convexity, $L$-smoothness, and learning rate $0<\eta<2/L$:

* **Monotonic decrease**:
  $$
  f(\theta_{k+1}) \le f(\theta_k).
  $$

* **Gradient norms go to zero** at rate $O(1/\sqrt{k})$.

* **Function values converge** at rate
  $$
  f(\theta_k) - f(\theta^*) \le \frac{|\theta_0-\theta^*|^2}{2\eta k}
  $$
  i.e., $O(1/k)$.

# Remarks for Neural Networks

* Neural network losses are **nonconvex**, so global convergence guarantees from convex theory do **not** apply.
* However, if the loss is **$L$-smooth**, then a sufficiently small learning rate $0<\eta<2/L$ ensures
  $$f(\theta_{k+1}) \le f(\theta_k),$$
  so gradient descent still makes **monotone descent** even without convexity.
* Modern deep networks are **overparameterized**, and their local regions of the loss often behave *approximately convex*; wide minima and locally positive-semidefinite Hessians appear frequently.
* The loss landscape is dominated by **saddle points**, not bad local minima; most local minima are “good” with similar loss values.
* Smoothness-based tools (descent lemma, Lipschitz gradient bounds) still apply in the nonconvex setting and explain why training remains stable.
* Convex results give a **baseline theory**; neural networks violate convexity, but many convex inequalities still hold locally or approximately, which is enough for gradient-based training to work well in practice.


# Numerical Example (2–2–1 Network, Explicit Derivatives)

Input and label:

$$
x=\begin{bmatrix}1 \\ 2\end{bmatrix},
\qquad
y=1.
$$

Parameters:

$$
W_1=\begin{bmatrix}0.1 & -0.2 \\ 0.4 & 0.5\end{bmatrix},
\qquad
b_1=\begin{bmatrix}0.1 \\ -0.1\end{bmatrix}
$$

$$
W_2=\begin{bmatrix}0.3 & -0.4\end{bmatrix},
\qquad
b_2=0.2.
$$

# Forward Pass

## Hidden pre-activation

$$
z_1=W_1x+b_1
$$

Compute:

$$
W_1x
=
\begin{bmatrix}0.1 & -0.2 \\ 
0.4 & 0.5\end{bmatrix}
\begin{bmatrix}1 \\ 2\end{bmatrix}
=
\begin{bmatrix}0.1(1) + (-0.2)(2) \\ 
0.4(1) + 0.5(2)\end{bmatrix}
=
\begin{bmatrix}0.1 - 0.4 \\ 
0.4 + 1.0\end{bmatrix}
=
\begin{bmatrix} -0.3 \\ 
1.4\end{bmatrix}
$$

Adding $b_1$:

$$
z_1
=

\begin{bmatrix}-0.3 \\ 1.4\end{bmatrix}
+
\begin{bmatrix}0.1 \\ -0.1\end{bmatrix}
=

\begin{bmatrix}-0.2 \\ 1.3\end{bmatrix}
$$

## Hidden activation (ReLU)

$$
\text{ReLU}(u)=\begin{cases}u & u>0\\ 0 & u\le 0\end{cases}
$$

Thus:

$$
a_1=\text{ReLU}(z_1)
=\begin{bmatrix}0 \\ 1.3\end{bmatrix}
$$

## Output pre-activation

$$
z_2=W_2 a_1 + b_2
$$

Compute:

$$
W_2 a_1
=
\begin{bmatrix}0.3 & -0.4\end{bmatrix}
\begin{bmatrix}0\\
1.3\end{bmatrix}
=
0.3(0)+(-0.4)(1.3) = -0.52
$$

Thus:

$$
z_2 = -0.52 + 0.2 = -0.32
$$

## Output activation (sigmoid)

$$
\sigma(z)=\frac{1}{1+e^{-z}}
$$

Thus:

$$
a_2 = \sigma(-0.32)
= \frac{1}{1+e^{0.32}}
\approx 0.420
$$



# Output-Layer Delta (Explicit)

Binary cross-entropy:

$$
L = -\left[y\ln a_2 + (1-y)\ln(1-a_2)\right]
$$

Derivative:

$$
\frac{\partial L}{\partial z_2}=a_2-y
$$

Thus:

$$
\frac{\partial L}{\partial a_2} = 0.420 - 1 = -0.58
$$

Sigmoid derivative:

$$
\frac{\partial a_2}{\partial z_2} = \sigma'(z_2)=a_2(1-a_2)=0.420(1-0.420)\approx 0.2436
$$

Output delta:

$$
\delta_2
=
\frac{\partial L}{\partial a_2}
\frac{\partial a_2}{\partial z_2}
=
(-0.58)(0.2436)
\approx -0.141
$$



# Output-Layer Gradients

$$
\frac{\partial L}{\partial W_2}
=\delta_2 a_1^\top
$$

Compute:

$$
\delta_2 a_1^\top
= -0.141
\begin{bmatrix}0 & 1.3\end{bmatrix}
=
\begin{bmatrix}0 & -0.183\end{bmatrix}
$$

Bias:

$$
\frac{\partial L}{\partial b_2}
=\delta_2=-0.141
$$



# Hidden-Layer Delta

Backprop term:

$$
W_2^\top\delta_2
=
\begin{bmatrix}0.3 \\ -0.4\end{bmatrix}(-0.141)
=
\begin{bmatrix}
0.3(-0.141) \\ -0.4(-0.141)
\end{bmatrix}
=
\begin{bmatrix}
-0.0423 \\ 0.0564
\end{bmatrix}
$$

ReLU derivative:

$$
\text{ReLU}'(z_1)
=
\begin{bmatrix}
0 \\ 1
\end{bmatrix}
$$

Thus:

$$
\delta_1
=
(W_2^\top \delta_2)\odot \text{ReLU}'(z_1)
=
\begin{bmatrix}-0.0423 \\ 0.0564\end{bmatrix}
\odot
\begin{bmatrix}0 \\ 1\end{bmatrix}
=
\begin{bmatrix}0 \\ 0.0564\end{bmatrix}
$$



# Hidden-Layer Gradients

Compute:

$$
\frac{\partial L}{\partial W_1}
=
\delta_1 x^\top
=
\begin{bmatrix}0 \\ 0.0564\end{bmatrix}
\begin{bmatrix}1 & 2\end{bmatrix}
=
\begin{bmatrix}
0 & 0 \\
0.0564 & 0.1128
\end{bmatrix}
$$

Bias gradient:

$$
\frac{\partial L}{\partial b_1}
=\delta_1
=\begin{bmatrix}0 \\ 0.0564\end{bmatrix}
$$



# Gradient Summary

Output layer:

$$
\frac{\partial L}{\partial W_2}
=
\begin{bmatrix}
0 & -0.183
\end{bmatrix},
\qquad
\frac{\partial L}{\partial b_2}=-0.141
$$

Hidden layer:

$$
\frac{\partial L}{\partial W_1}
=

\begin{bmatrix}
0 & 0 \\
0.0564 & 0.1128
\end{bmatrix},
\qquad
\frac{\partial L}{\partial b_1}
=

\begin{bmatrix}0 \\ 0.0564\end{bmatrix}
$$
