Notes based on lectures and youtube video: https://www.youtube.com/watch?v=CqOfi41LfDw&ab_channel=StatQuestwithJoshStarmer

# Universal approximation theorem
The great efficacy of neural networks can largely be contributed to the universal approximation theorem, which can simply be written as:

$$
\vert F(\boldsymbol{x})-f(\boldsymbol{x};\boldsymbol{\Theta})\vert < \epsilon \hspace{0.1cm} \forall \boldsymbol{x}\in[0,1]^d.
$$

Where $F(\boldsymbol{x})$ is a continuous function and deterministic function defined on the unit cube in $d$-dimensions, $F : [0,1]^d \rightarrow \mathbb{R} $. We aim to approximate $F$ for any given small positive error tolerance $\epsilon > 0$. $f(\boldsymbol{x}; \boldsymbol{\Theta})$ is a one-layer hidden neural network with parameters $\boldsymbol{\Theta} = (\boldsymbol{W},\boldsymbol{b})$ where the weights $\boldsymbol{W}\in\mathbb{R}^{m\times n}$ and the biases $\boldsymbol{b}\in \mathbb{R}^{n}$. 

The neural network $f$ takes the input and computes for each neuron the activation value $\boldsymbol{z_i}$ as: 
$$\boldsymbol{z_i} = \boldsymbol{w_i}\cdot \boldsymbol{x} + b_i$$
Where $\boldsymbol{w_i}$ and $b_i$ are the weight vector and bias for each neuron $i$. This value is then for each neuron passed through a continuous sigmoidal function $\sigma(\boldsymbol{z})$. Otherwise known as an activation function. This function has the property: 
$$
\sigma(\boldsymbol{z}) = \left\{\begin{array}{cc} 1 & \boldsymbol{z}\rightarrow \infty\\ 0 & \boldsymbol{z} \rightarrow -\infty \end{array}\right.
$$
Such a function could be for example the standard logistic function: 
$$ \sigma(\boldsymbol{z}) = \frac{1}{1+\mathrm e^{-\boldsymbol{z}}} $$
The function introduces non-linearity, which enables the possibility of approximating functions of high complexity.

By summing the outputs of all the neurons, weighted directly by the network's parameters, we approximate $F(\boldsymbol{x})$. Specifically, if we denote each neuron's activated output as $\sigma(z_i)$, where $z_i = \boldsymbol{w_i} \cdot \boldsymbol{x} + b_i$, then the final output of the neural network can be expressed as:
$$
f(\boldsymbol{x}, \boldsymbol{\Theta}) = \sum_{i=1}^{m} \sigma(z_i)
$$
Each neuron's contribution is implicitly scaled by the weights $\boldsymbol{w_i}$ and biases $b_i$ in the network. Through careful adjustment of these parameters, we can control each neuron's impact on the overall approximation, allowing $f(\boldsymbol{x}; \boldsymbol{\Theta})$ to closely approximate $F(\boldsymbol{x})$ within the given error tolerance $\epsilon$.

In single-layer networks, the Universal Approximation Theorem guarantees that with sufficient neurons, the right weights $\boldsymbol{W}$, and biases $\boldsymbol{b}$, we can approximate any continuous function. For multi-layer networks, which often enhance approximation power, backpropagation enables efficient tuning of parameters across layers through gradient descent methods.


# Back propagation
Back propagation is a process in which we attempt to improve a neural network by adjusting the internal parameters through calculating gradients of the cost/loss function wrt these parameters. We examine the initial predictions of the network, and then work backwards from the output, adjusting each layer of neurons using the chain rule of calculus.

### Forward pass
First we do a forward pass, where we have an input that goes trough the network and predicts an output. Again we're interested in the activation values, but now we look at multiple layers, which is determined by the immeadiately preceding layer. The expression for each neuron $j$ in layer $l$ is then:

$$
z_j^l = \sum_{i=1}^{M_{l-1}}w_{ij}^la_i^{l-1}+b_j^l,
$$

Here $a_i^{l-1}$ is the output from the previous layer $\sigma(z_j^l)$. The total number of neurons is $M_{l-1}$. The final layer gives the predicted outcome $\hat{y}$. 

### Error calculation
We then use a cost/loss function such as MSE or Cross-Entropy Loss to evaluate how far off the target our prediction is. Our goal is to minimize the function.

The MSE cost function is defined as:

$$
{\cal C}(\boldsymbol{\Theta})  =  \frac{1}{2}\sum_{i=1}^n\left(y_i - \tilde{y}_i\right)^2,
$$



### Gradient calculation
To adjust the weights and biases, we compute the gradient of the error function wrt each bias and weight. Some parameters contribute more to the error than others, which means that they need greater adjustments. 

### Gradient adjustment