# Improving neural network learning
---

# The cross-entropy cost function
---
> We learn slowly when our errors are less well-defined.

The derivative of the quadratic cost function is -

$
\begin{eqnarray} 
  \frac{\partial C}{\partial w} & = & (a-y)\sigma'(z) x = a \sigma'(z) \tag{1}\\
  \frac{\partial C}{\partial b} & = & (a-y)\sigma'(z) = a \sigma'(z),
\tag{2}\end{eqnarray}
$
**Note** - First differentiate $C$ w.r.t $a$, then $a$ w.r.t $z$, then $z$ w.r.t $w$ or $b$.


**Origin of slow learning** - If the input values to the neuron are too large or too small, then the sigmoid function curve gets very flat, so the derivative of sigma gets very small.
This can be solved by using a different cost function rather than the quadratic cost function.

The cross-entropy function is defined as 
$
\begin{eqnarray} 
  C = -\frac{1}{n} \sum_x \left[y \ln a + (1-y ) \ln (1-a) \right],
\tag{A}\end{eqnarray}
$
where n is the total number of training data, sum over all the training inputs, x and output y.

Cross-entropy as cost function --

- Non-negative - All individual terms are negative, since the `log` is in range of 0 to 1. There is also the negative sign out front.
- If the actual output is close to desired output for all training inputs, $C \approx 0$

$C$ tends towards zero as the neuron gets better at computing the desired output. It has the benefit over quadratic cost function that it avoids the slow learning problem.
Computing the partial derivative of the cross-entropy cost with respect to the weights.
$
\begin{eqnarray}
  \frac{\partial C}{\partial w_j} & = & -\frac{1}{n} \sum_x \left(
    \frac{y }{\sigma(z)} -\frac{(1-y)}{1-\sigma(z)} \right)
  \frac{\partial \sigma}{\partial w_j} \tag{3}\\
 & = & -\frac{1}{n} \sum_x \left( 
    \frac{y}{\sigma(z)} 
    -\frac{(1-y)}{1-\sigma(z)} \right)\sigma'(z) x_j.
\tag{4}\end{eqnarray}
$
simplifying and canceling gives us

$
\begin{eqnarray} 
  \frac{\partial C}{\partial w_j} =  \frac{1}{n} \sum_x x_j(\sigma(z)-y).
\tag{4}\end{eqnarray}
$

It tells us that the rate at which the wight learns is controlled by $\sigma(z)-y$, i.e. error in the output. Larger the error, faster the neuron will learn.
When we use cross entropy cost function,  $\sigma'(z)$ gets canceled out so slow learning will not be a problem. This cancellation is the special miracle ensured by the cross-entropy cost function.

The cost function for all the neurons in the output layer
$
\begin{eqnarray}  C = -\frac{1}{n} \sum_x
  \sum_j \left[y_j \ln a^L_j + (1-y_j) \ln (1-a^L_j) \right].
\tag{B}\end{eqnarray}
$

- It is the generalization of the cross-entropy for probability distributions.

## Where does cross-entropy comes from

To-Do later

# Softmax
---

# Overfitting and regularization
---

Models with large number of free parameters can describe an amazingly wide range of phenomena. Even if a model agrees well with the available data, that doesn't make it a good model. It may mean there is enough freedom in the model that it can describe almost any dataset of given size without capturing any genuine insights into the underlying phenomenon. Model then fails to generalize to new situations. So, how can we trust the results of neural network when it contains hundreds and thousands of parameters.

During the training of a model, the cost might appear to decrease after every epoch but the accuracy might not improve much after certain epoch of training, which indicates that the model fails to generalize after certain steps. The cost improvement is an illusion. It is called as `overfitting` or `overtraining`.

Very high test accuracy also hints at the possibility of overfitting of the model.
The obvious way to detect overfitting is to keep track of accuracy on the test data when the network trains. If the accuracy of the test data is no longer improving, then we should stop the training. We can use a `validation set` to test our model first for signs of overfitting.

**Early Stopping** - We compute the classification accuracy on the `validation data`, once it is saturated, we stop the training.

Why use the `validation data` rather than `test data` to prevent overfitting?

It is part of a more general stategy, which is used to evaluate different trial choices of hyper-parameters such as epochs, learning rate and so on.
There are many different choices of hyper-parameters. If we set the hyper-parameters based on the evaluations on the `test data` it's possible we may end up overfitting the hyper-parameters to the `test data`. We guard against that by figuring out the hyper-parameters using the `validation data`. Then it gives the true measure of generalization on the `test data`.

- Also, adding more `training data` can help to reduce overfitting, but it is expensive and not practical.

# Regularization
---

