# Chapter 38 - Introduction to Neural Networks (pg 480)

<b>Supervised neural networks</b> are given data in the form of inputs and tar-
gets, the targets being a teacher's specification of what the neural network's
response to the input should be.

<b>Unsupervised neural networks</b> are given data in an undivided form { simply
a set of examples {x}.

![Image](Images/Ch38_1.PNG)

# Chapter 39 - The Single Neuron as a Classifier (pg 483)

![Image](Images/Ch39_1.PNG)
![Image](Images/Ch39_2.PNG)


The central idea of supervised neural networks given an input vector $x$, and a target $t$, is to learn a model of the relationship between $x$ and $t$. A
successfully trained network will, for any given $x$, give an output $y$ that is
close (in some sense) to the target value $t$. Training the network involves
searching in the weight space of the network for a value of $w$ that produces a
function that fits the provided training data well.

Typically an <b>objective function</b> or <b>error function</b> is defined, as a function of $w$, to measure how well the network with weights set to $w$ solves the task. The objective function is a sum of terms, one for each input/target pair {x, t}, measuring how close the output y(x; w) is to the target $t$. 

The training process is an exercise in function minimization i.e., adjusting $w$ in such a way as to find a $w$ that minimizes the objective function. For general feedforward neural networks the backpropagation algorithm efficiently evaluates the gradient of the output $y$ with respect to the parameters $w$, and thence the gradient of the objective function with respect to $w$.

We can then write down the following error function:
$$G(w) = - \sum_{n} \left[t^{(n)} \ln y\left(x^{(n)};w\right) + (1 - t^{(n)}) \ln\left(1 - y\left(x^{(n)};w\right)\right) \right]$$
The objective function is bounded below by zero and only attains this value if $y(x^{(n)};w) = t^{(n)}$ for all $n$.


The <b>backpropagation</b> algorithm:
$$g_{j} = \frac{\delta G}{\delta w_{j}} = \sum_{n=1}^{N} -(t^{(n)} - y^{(n)})x^{(n)}_{j}$$

The simplest thing to do with a gradient of an error function is to descend it.

<b>On-line learning algorithm</b>

The teacher supplies a target value $t \in {0, 1}$ which says what the correct answer is for the given input. We compute the error signal:
$$e = t - y$$
then adjust the weights $w$ in a direction that would reduce the magnitude
of this error:
$$\Delta w_{i} = \eta ex_{i}$$
where $\eta$ is the learning rate. Commonly $\eta$ is set by trial and error to a constant value or to a decreasing function of simulation time $\tau$ such as
$\eta_{0}/\tau$.

An alternative paradigm is to go through a batch of examples, computing the outputs and errors and accumulating the changes at the end of the batch (<b>Batch learning</b>).

This batch learning algorithm is a gradient descent algorithm, whereas the
on-line algorithm is a stochastic gradient descent algorithm.

<b>Overfitting</b>

An ad hoc solution to overfitting is to use early stopping, that is, use
an algorithm originally intended to minimize the error function G(w), then
prevent it from doing so by halting the algorithm at some point.

A more principled solution to overfitting makes use of regularization. Regularization
involves modifying the objective function in such a way as to incorporate
a bias against the sorts of solution $w$ which we dislike.

We modify the objective function to:
$$M(w) = G(w) + \alpha E_{W}(w)$$
where the simplest choice of regularizer is the weight decay regularizer
$$E_{W}(w) =\frac{1}{2}\sum_{i}w^{2}_{i}$$
The regularization constant $\alpha$ is called the weight decay rate. This additional
term favours small values of $w$ and decreases the tendency of a model to overfit
fine details of the training data. The quantity $\alpha$ is known as a hyperparameter.

Gradient descent with a step size $\eta$ is in general not the most efficient way to
minimize a function. Most neural network experts use more advanced optimizers such as conjugate gradient algorithms.

# Chapter 40 - Capacity of a Single Neuron (pg 495)



In [None]:



# Chapter 41 - Learning as Inference (pg 504)
# Chapter 42 - Hopfield Networks (pg 517)
# Chapter 43 - Boltzmann Machines (pg 534)
# Chapter 44 - Supervised Learning in Multilayer Networks (pg 539)
# Chapter 45 - Gaussian Processes (pg 547)
# Chapter 46 - Deconvolution (pg 561)

# Appendix - (pg 610)