# Gradient Based Learning

Designing and training a neural network is not much different from training any other nmachine learning model with gradient descent.
The Largest difference between the linear models and neural networks is that the nonlinearity of a neural network causes interesting loss
functions to become nonconvex. So neural networks are usually trained by using iterative, gradient-based optimizers that merely drive the cost
function to a very low value, rather than the linear equation solvers used to train linear regression models or the convex optimization algorithms
with global convergence starting from any initial parameters.

For the moment, it suffices to understand that the training algorithm is almost always based on using the gradient to descend the cost
function in one way or another. The specific algorithms are improvements and refinements on the idea of gradient descent.

As with other machine learning models, to apply gradient-based learning we must choose a cost function, and we choose how to represent the output of
the model. We now revisit these design considerations with special emphasis on the neural networks scenario.

## Cost Functions

An important aspect of the design of a deep neural network is the choice of the cost function.

In most cases, our parametric model defines a distribution $p(y|x;\theta)$ and we simply use the principle
of maximum likelihood. This means we use the cross-entropy between the training data $x,y \sim \hat{p}_{data}$ and 
the model's predictions $p_{model} (y|x)$ as the cost function.

This cost function is given by
$$
 J(\theta) = - \mathbb{E}_{x,y \sim \hat{p}_{data}} (\log p_{model} (y|x))
$$

where $\mathbb{E}_{x,y\sim\hat{p}_{data}}$ is the expected value operator with respect to the empirical distribution $hat{p}_{data}$.
Note that if $p_{model} (y|x) = \mathcal{N}(y; f(x;\theta), I)$ then we can recover the mean squared error cost
$$
 J(\theta) = \frac{1}{2}\mathbb{E}_{x,y \sim \hat{p}_{data}} \Vert y - f(x;\theta) \Vert^2 + C
$$

where $f(x;\theta)$ is a linear model, up to a scaling factor $\frac{1}{2}$ and a term that does not depend on $\theta$. The discarded 
constant is based on the variance of the Gaussian distribution.

Note that the equivalence between the maximum likelihood estimation with an output distribution and minimization of mean squared error holds
for a linear model, but in fact, the equivalence holds regardless of the $f(x;\theta)$ use to predict the mean of the Gaussian.

The advantage of this approach is specifying a model $p(y|x)$ automatically determines a cost function $\log p(y|x)$. One unusual property of the
cross-entropy cost used to perform maximum likelihood estimation is that it usualy does not have a minimum value when applied to the models
commonly used in practice. For discrete output variables, most models are parametrized in such a way that they cannot represent a probability of zero
or one, but can come arbitrarily close to doing so. Logisic regression is an example of such a model.

Instead of learning a full probability distribution $p(y|x;\theta)$, we often want to learn just one conditional statistic
of y given x.

For example, we may have a predictor $f(x;\theta)$ that we wish to employ to predict the mean of y. If we use a sufficiently 
powerful neural netowrk, we can think of the neural network as being able to represent any function $f$ from a wide class of functions,
with this class being limited only by features such as continuity and boundedness rather than by having a specific parametric
form. We can view the cost functionas as being a **functional** rather than just a function. A functional is a mapping from 
frunctions to real numbers. We can design out cost functional to have its minimum occur at some specific function we desire.

For example, we can design the cost functional to have its minimum lie on the function that maps x to the expected value of y
given x. Solving an optimization problem with respect to a function requires a mathmatical tool called **calculus of variations**.

Our first result derived using calculus of variations is that solving the optimization problem
$$
f^* = \argmin_{f} \mathbb{E}_{x,y \sim p_{data}} \Vert y - f(x) \Vert^2
$$

yields
$$
f^*(x) = \mathbb{E}_{y \sim p_{data}(y|x)} [y]
$$

So long as this function lies within the class we optimize over. I.e. if we could train this function on infinitely many samples from the
true data generating distribution, minimizing the mean squared error cost function would give a function that predicts the mean of y for 
each value of x.

A second result derived using calculus of variations is that
$$
f^* = \argmin_{f} \mathbb{E}_{x,y \sim p_{data}} \Vert y - f(x) \Vert_1
$$

yields a function that predicts the median value of y for each x, as long as such a function may be described by the family of functions we
optimize over. This cost function is commonly called **mean absolute error**.

Unfortunately, mean squared error and mean absolute error often lead to poor results when used with gradient-based optimization. This is one
reason that the cross-entropy cost function is more popular that mean squared error or mean absolute error, even when it is not necessary to
estimate an entire distribution $p(y|x)$.

## Output Units

The choice of cost function is tightly coupled with the choice of output unit. The choice of how to represent the output then
determines the form of the cross-entropy function.

We will suppse that the feedforward networkk provides a set of hidden features defined by $h=f(x;\theta)$. The role of the output
layer is then to provide some additional transformation from the features to complete the task that the network must perform.

One simple kind of output unit is based on an affine transformation with no nonlinearity. These are often just
called linear units.

Given features $h$, a layer of linear output units produces a vector $\hat{y} = W^Th + b$. Linear output layers are often 
used to produce the mean of a conditional Gaussian distribution:
$$
p(y|x) = \mathcal{N}(y; \hat{y}, I).
$$

Maximizing the log-likelihood is then equivalent to minimizing the mean squared error. 

Because linear units do not saturate, they pose little difficulty for gradient-based optimization algorithms and may be used
with a wide variety of optimization algorithms.

The maximum likelihood approach is to define a Bernoulli distribution over y conditioned on x.

A Bernoulli distribution is defined by just a single number. The neural net needs to predict only
$ P(y=1|x)$. For this number to be a valid probability, it must lie in the interval $[0,1]$.

Suppose we were to use a linear unit and threshold its value to obtain a valid probability:
$$
P(y=1|x) = \max \{ 0, \min \{ 1, w^Th+b \} \}
$$

This would indeed define a valid conditional distribution, but we would not be able to train it
very effectively with gradient descent. Any time that $w^Th + b$ strayed outside the unit interval,
the gradient of the output of the model with respect to its parameters would be 0. A gradient of 0
is typically problematic because the learning algorithm no longer has a guide for how to improve the 
corresponding parameters.

Its better to use a different approach that ensures there is always a strong gradient whenever the
model has the wrong answer. This approach is based on using sigmoid output units combined with 
maximum likelihood.

A sigmoid output unit is defined by
$$
\hat{y} = \sigma(w^Th+b)
$$

where $\sigma$ is the logistic sigmoid function. The sigmoid can be motivated by constructing an
unnormalized probability distribution $\hat{P}(y)$, which does not sum to 1. We can then divide by an 
appropriate constant to obtain a valid probability distribution. If we assume that the unnormalized 
log probabilities are linear in y and z, we can exponentiate to obtain the unnormalized probabilities:

$$
\log \tilde{P}(y) = yz
$$
$$
\tilde{P}(y) = \exp(yz)
$$
$$
P(y) = \frac{\exp(yz}{\sum_{y'=1}^1 \exp(y'z}
$$
$$
P(y) = \sigma((2y-1)z)
$$

Probability distributions based on exponentiation and normalization are common throughout the statistical
modeling literature. The z variable defining such a distribution over binary variables is called a **logit**.

This approach to predicting the probabilities in log space is natural to use with maximum likelihood learning.
Because the cost function used with maximum likelihood is $-\log P(y|x)$, the log in the cost function undoes the exp
of the sigmoid. The loss function for maximum likelihood learning of a Bernoulli parametrized by a sigmoid is
$$
J(\theta) = -\log P(y|x)
$$
$$
= -\log \sigma((2y-1)z)
$$
$$
= \zeta ((2y-1)z)
$$

Any time we wish to represent a probability distribution over a discrete variable with $n$ possible values,
we may use the softmax function. This can be seen as a generalization of the sigmoid function, which was used
to represent a probability distribution over a binary variable.

Softmax functions are most often used as the output of a classifier, to represent the probability distribution over
n different options for some internal variable.

In the case of binary variables, we wished to produce a single number
$$
\hat{y} = P(y=1|x)
$$

Because this number needed to lie between 0 and 1, and because we wanted the logarithm of the number to be well
behaved for gradient-based optimization of the log-likelihood, we chose to instead predict a number 
$z = \log \tilde{P}(y=1|x)$. Exponentiating and normalizing gave us a Bernoulli distribution controlled by the 
sigmoid function.

To generalize to the case of a discrete variable with n values, we now need to produce a vector $\hat{y}$, with
$\hat{y_i} = P(y=i|x)$. We require that the entire vector sum is equal to 1 so that it represents a valid 
probability distribution. The same approach that worked for the Bernoulli distribution generalizes to the multinoulli
distribution: 
$$
z = W^Th+b
$$
were $z_i = \log \tilde{P}(y=i|x)$. The softmax function can then exponentiate and normalize z to obtain the desired
$\hat{y}$. Formally, the softmax function is given by
$$
\text{softmax}(z)_i = \frac{\exp(z_i)}{\sum_{y'=1}^1 \exp(z_i)}
$$
As with the logistic sigmoid, the use of the exp function works well when training the softmax to output a target value
y using maximum log-likelihood. In this case, we wish to maximize $\log P(y=i; z) = \log softmax(z)_i$. Defining the
softmax in terms of exp in natural because the log in the log-likelihood can undo the exp of the softmax:
$$
\log softmax(z)_i = z_i - \log \sum_{j} \exp(z_i)
$$