## General approach

A framework to design learning algorithms:

$ \DeclareMathOperator*{\argmin}{arg\,min} $
$ \underset{\theta} \argmin \frac{1}{T} \sum_T l(f(x^{(i)};\theta), y^{(i)}) + \lambda \Omega(\theta)$

* $l(f(x^{(i)};\theta), y^{(i)})$ is a loss function
* $\Omega(\theta)$ is a regularizer (penalizes certain values of $\theta$)

Learning is cast as an optimization problem. We are searching for the $\theta$ that minimizes the expression.

Remember that $\theta$ is the variable of this expression, $x^{(i)}$ and $y^{(i)}$ are fix values from the data set.

The expression has two parts. The first is the average of a loss function that compares the output of the network with the actual label. The second is a regularizer that penalizes certain values of $\theta$.

The hyper-parameter $\lambda$ controls the balance between minimizing the average loss and the regularizer function.


## Gradient Descent

By casting the learning problem to an optimization problem we can apply an algorithm like Gradient Descent:

* Initialize $\theta$, where $ \theta = \{ W^{(1)}, b^{(1)}, ..., W^{(L+1)}, b^{(L+1)} \}$
* For N iterations
  * For each training example $(x^{(i)}, y^{(i)})$
    * $\Delta = -\nabla_\theta l(f(x^{(i)};\theta), y^{(i)}) - \lambda \nabla_\theta \Omega(\theta)$
    * $ \theta = \theta + \alpha \Delta $
 
N is a hyper-parameter. $\alpha$ is the step-size or learning rate, this is also a hyper-parameter.

Epoch = iteration over **all** training examples

To apply this algorithm to neuronal network training we need:

 * initialization function
 * the loss function $l(f(x^{(i)};\theta), y^{(i)})$
 * a way to compute the loss gradients $ \nabla_\theta l(f(x^{(i)};\theta), y^{(i)}) $
 * a way to compute the regularizer gradients $ \nabla_\theta \Omega(\theta) $ 
 
There are several variations to this algorithm:
 * Gradient Descent: Average the gradients of **all** training examples and than do a single update on $\theta$
 * Stochastic Gradient Descent: Calculate the gradient for each training example and update $\theta$ individually
 * Mini-batch Gradient Descent: Average the gradients of a batch of training examples and update $\theta$ with this value

## Loss function for classification

For classification the neuronal network calculates: 

$ f(x) = [f(x)_{c_1},...,f(x)_{c_n}]$

where $c_i$ represents a class and $f(x)_c = p(y=c \,|\ x)$. 

All $f(x)_c$ sum up to 1.

We want to maximize the probability of $y^{(i)}$ given $x^{(i)}$

To do this we minimize the **negative log-likelihood** or **negative log-probability**:

$ l(f(x),y) = - \sum_c 1_{(y=c)} log(f(x)_c) = -log(f(x)_y)$

So the loss for a sample is the negative log of the y-th element of the output vector (this assumes y is 0-based).

The log is used to improve numerical stability and math simplicity. Maximizing a value z is the same as maximizing log(z) because it is a monotonical increasing function. We use the negative log because we want to minimize the loss. Minimizing a value -z is the same as maximizing z.

This loss function is also known as **cross-entropy**.
