# The Perceptron

The *Perceptron* is on of the simplest Artificial neural network architectures, proposed in 1957 by Frank Rosenblatt. It is based on a *threshold logic unit (TLU)* and it computes a weighted sum of its inputs

$$ z = w_1x_1 + \cdots + w_nx_n = \textbf{x}^{\intercal}\textbf{w} $$

then applies a step function to that sum and outputs the result: $h_w(\textbf{x})=\text{step}(\textbf{x})$. One of the most common step function used is the *Heaviside step function*

$$ \text{heaviside}(z) = \begin{cases} 0 & \text{if } z<0 \\ 1 & \text{if } z\gt0 \end{cases}$$

A single TLU can be used for binary classification; it computes a linear combination of its inputs and if the output reaches a threshold, it outputs a positive class, otherwise outputs the negative class.

A perceptron is composed of a single layer of TLUs, with each TLU connected to all the inputs. When all the neurons in a layer are connected to every neuron in the previous layer the layer is called a *fully connected* or *dense* layer. *Input Neurons* are simple inputs that output whatever they are fed and all input neurons form the *input layer*. A bias neuron is generally added, tipycally represented by a *bias neuron*, which outputs 1 all the time. (e.g. architecture pg 286 fig 10-5)

We can then write the outputs of a fully connected layer as 
$$ h_{\textbf{W, b}}(\textbf{X}) = \phi(\textbf{XW + b})$$
Where
- $\textbf{X}$ is the matrix of input features (one row per instance, one col per feature)
- $\textbf{W}$ contains the connection weights, except the ones from the bias neuron (one row per input neuron, one column per artificial neuron in the layer)
- $\phi$ is called the *activation function* (when the neurons are TLU, this is a step function)

The perceptron learning rule reinforces connections between neurons tha help reduce the error: the perceptron is fed one training instance at a time, and for each instance it makes its predictions. For every output neuron that produced a wrong predictions, it reinforces the connection weights from the inputs that would have contributed to the correct prediction

$$ w_{i,j}^{\text{next step}} = w_{i,j} +\eta(y_j - \hat{y_j})x_i$$

where 
- $w_{i,j}$ is the weight between ith input neuron and jth output neuron
- $x_i$ is the ith input value of the current training instance
- $\hat{y_j}$ is the output of the jth output neuron 
- $y_j$ is the target output of the jth ouptut neuron
- $\eta$ is the learning rate

# The multilayer perceptron and backpropagation

An MLP consistis of one input layer, one or more layers of TLUs (called *hidden layers*) and one final layer of TLUs called the *output layer*. Every except the output layer includes a bias neuron and is fully connected to the next layer.

To train an MLP, we use [backpropagation](https://homl.info/44). In short, it is Gradient Descent and it is able to compute the gradient of the network's error with regard to every single model parameter, thus it is able to find out how much it should tweak each connection weight and bias in order to reduce the error. This process is called *autodiff*, appendix D has more info on it.

Here's how it works
- It handles one mini-batch at a time (e.g. 32 instances) and goes through the training set multiple times, each pass is called an *epoch*
- Each mini-batch is passed is passed to the network's input layer, which sends it to the first hidden layer. The algorithm then computes the outputs of this layer and passes it to the next layer, and so on, until we get the output of the output layer. This is called a *forward pass* and the intermediate results are saved
- Next we calculate the network's output error (using some loss function)
- Then it computs how much each output connection contributed to the error (done using chain rule)
- The algorithm then measures how much of these error contributions came from each connection in the layer below until it reaches the input layer
- Finally, it performs a gradient descent step to tweak all the connection weights in the network

One change that had to be made to the original MLP architecture was replacing the step function with the logistic function $\sigma(z) = 1 / (1 +\exp(-z))$, this allows for gradients to be computed as it is a smooth function.

Some other choices of function are:
- Hyperbolic tan $\tanh(z) = 2\sigma(2z) - 1$

Another S-shaped function, continues and differentiable. Its outputs are in the range -1 to 1, making each layer's output more or less centered around 0 at the beginning of training, which helps speed up convergence.

- Rectified Linear unit $ReLU(z) = \max(0,z)$

Continuous but not differentiable at $z=0$, however it works very well and has become the default.

Activation functions are useful because they can add non-linearity to each layer. Recall that a linear transformation of linear transformations is also linear. Using a non-linear function allows for an MLP to learn more complex patterns.

## Regression MLPs

To use MLP for regression we use an output neuron for each value we want to predict. In the univariate case (e.g. predicting house price) only a single output neuron is needed. 

For multivariate problems, you need one output neuron per output dimension. For example to locate the center of an object in an image, you need to predict 2D coordinates, thus 2 output neurons. If you also want to place a bounding box around the object, you need two more numbers, the width and height of the object. In total, 4 output neurons.

In general we do not want to use any activation function for output neurons so they are free to output any range of values. To guarantee the range of values is always positive, use ReLU or *softplus*, which is a smooth variant of ReLU: $\text{softplus}(z) = \log(1 + \exp(z))$. 
Finally if we want to guarantee the predictions will fall between a range of values we can use the logistic or hyperbolic tangent function, scaling the labels to the appropriate values.

The typical loss function used is MSE, however if you have a lot of outliers in your training set you may want to use the mean absolute error instead. Alternatively use [Huber loss](https://en.wikipedia.org/wiki/Huber_loss), which is a combination of both.


Typical regression MLP architecture

| Hyperparameter | Typical value |
|     ---        |      ---      |
|# input neurons | One per input feature (e.g. 28x28=784 for MNIST) | 
|# hidden layers | Variable (typically 1 to 5) |
|# neurons per hidden layers | Variable (typically 10 to 100) |
|# output layer | 1 per prediction dimension |
|Activation function | $\begin{cases} 
                        \text{None} & \text{ for any range of values } \\ 
                        ReLU/\text{softplus} & \text{ positive outputs }\\
                        \text{logistic/tanh} & \text{ bounded outputs}
                        \end{cases}$ |
|Loss Function | MSE or MAE/Huber (if outliers)|

## Classification MLPs