# Multi Layer Perceptrons | Lecture 1

A perceptron (threshold unit) can learn anything that it can represent (i.e. anything separable with a hyperplane)

How could we use this?

# Hidden Units & Weight Updates | Lecture 2

## Learning with Hidden Units

* Networks without hidden units are very limited in the input-output mappings they can model
    * more layers of linear units do not help. Its still linear
    * fixed output non-linearities are not enough

* We need multiple layers of adaptive non-linear hidden units. This gives us a universal approximator. But how can we train such nets?
    * We need an efficient way of adapting all the weights not just the last layer. This is hard. Learning the weights going into hidden units is equivalent to learning features

## Learning by adjusting weights

* Randomly adjust one weight and see if it improves performance. If so, save the change
    * very inefficient. We need to do multiple froward passes. On a representative set of training data just to change one weight.
    * towards the end of learning, large weight adjustments will nearly always make things worse.

* We could randomly adjust all the weights in parallel and correlate the performanec gain with the weight changes
    * Not any better because we need lots of trials to "see" the effect of changing one weight through the noise created by all the others.

Learning the output to hidden weights is easy. Learning the input to hidden weights is hard.

## Learning Algorithms for MLP

* similar to the perceptron learning algorithm
    * one minor difference is that we may have several outputs, so we have an output vector h_w(x) rather than a single value, and each example has an output vector y.
    * the major difference is that, whereas the error y - h_w at the perceptron output layer is clear, the rror at the hidden layers seems mysterious because the training data does not way what value the hidden nodes should have

We can back-propagate the error from the output layer to the hidden layers. the back-propagation process emerges directly from a derivation of the overall error gradient.

uses derivation of gradiaent to update weights

## The idea behind backpropagation

* we don't know what the hidden units ought to do, but we can compute how fast the error changes as we change a hidden activity
    * instead of using desired activities to train the hidden units, use error derivatives w.r.t. hidden activities
    * each hidden activity can affect many output units and can therfore have many separate effects on the error. These effects must be combined
    * we can compute error derivatives for all the hidden units efficiently
    * once we have the error derivatives for the hidden activities, its easy to get the error derivatives for the weights going inot a hidden unit.

## Backpropagation

* gradient descent over entire network weight vector
* easily generalized to arbitrary directed graphs
* will find a local, not necessarily global error minimum - in practice often works well (can be invoked multiple times with different initial weights)
often include weight momentum term

$$\Delta w_{i, j}(n) = \eta \text{ } \delta \text{ } x_{i,j} + \alpha \text{ } \delta \text{ } w_{i,j}(n-1)$$

* minimizes error training examples
* training can be slow typical 1000 - 10000 iterations
* using network after training is fast

## Convergence of Backprop

Gradient descent to some local minimum perhaps not global minimum

* add momentum term: $\Delta W_{k,i}(n)$ - $\Delta w_{i, j}(n) = \eta \text{ } \delta \text{ } x_{i,j} + \alpha \text{ } \delta \text{ } w_{i,j}(n-1)$ with $\gamma \epsilon [0, 1]$

* stochastic gradient descent
* train multiple nets with different initial weights

Nature of convergence

* initialize weights newar zero
* therefore, initial networks near-linear
* increasingly non-linear functions possible as training progresses

## backpropagation algorithm

* initialize each $w_i$ to some small random value
* until the termination condistion is met, Do
    - for each training example $<(x_i ... x_n), t>$ do
        - input the instance $(x_1 ... x_n)$ to the network and compute the network outputs $y_k$
        - for each output unit k
            - $\delta = y_k(1-y_k)(t_k - y_k)$
        - for each higgen unit h
            - $\delta = y_h(1-y_h)\sum_k w_{h,k} \Gamma_k$
        - for each network weight $w_{i,j}$ Do
            - $w_{i,j} = w_{i,j} + \Delta w_{i,j}$ where $\Delta w_{i,j} = \eta \delta_j x_{i,j}$

# Heuristics & Expressivity | Lecture 3

taking a look at how we can speed up convergence or multi layer perceptron

## Heuristics to speed convergence

* to speed the convergence of the back-propagation algorithm the following heuristics are applied:
    * H1: use sequential (online) vs bactch update
    * H2: maximize information content
        * use examples that produce largest error
        * use example which very different from all the previous ones
    * H3: use an antisymmetric activation function, such as the hyperbolic tangent. Antisymmetric means: $\phi (-x) = - \phi (x)$
    * H4: use different target values inside a smaller range, different from asymptotic values of the sigmoid

* H5: normaize the inputs: 
    * create zero-mean variables
    * decorrelate the variables
    * scale the variables to have covariances approximately equal
* H6: initialize properly the weights. Use a zero mean distribution with variance of: $\sigma_w = \frac{1}{\sqrt{m}}$
    * where m is the number of connections arriving to a neuron

## Adjsting Learning Rates

* R1: every adjustable parameter should have it's own learning rate
* R2: every learning rate should be allowed to adjust from one iteration to the next
* R3: when the derivative of the cost fucntion wrt a weight has the same algebraic sign for several consecutive iteration so fthe algortihm, the learing rate for that particular weight should be increased
* R4: when the algebraic sign of the derivative above alternates for several consecutive iterations of the algorithm the learning rate should be decreased

## General Comments

* empirical knowledges shows that the number of data pairs that are needed in order to achieve a givene error level $\epsilon$ is:

$$ N = o(\frac{W}{\epsilon})$$

* where W is the total number of adjustable parameters of the model. There is mathematical support for this observation (but we will not analyse this further)
* there is the curse of dimensionality for approximating functions in high-dimensional spaces
* it is theoretically justified to use two hidden layers

## Expressiveness of multi-layer feedforward networks

Boolean functions:
* every boolean function can be represented by a network with single hidden layer
* but might require exponential (in number of inputs) hidden units

Continuous functions:
* every bounded continuous function can be apporoximated with arbitrarily small error, by network with single hidden layer

Any function can be approximated to arbitrary accuracy by a netowrk with two hidden layers

## How long should you train the net?

* the goal is to achieve a balance between correct responses for the training patterns and correct responses for new patterns. (that is, a balance between memorization and generalization)

* if you train the net for too long, then you run the risk of overfitting

* select number of training iterations via cross-validation on holdout set

# Dataflow programming & TensorFlow | Lecture 4