# Adaptive Linear Neuron
#### ADAptive LInear Neuron (Adaline)

Just a few years after Frank Rosenblatt published his findings on the perceptron model, Bernard Widrow published his improvement to the Perceptron model. The Adaline model makes use of a minimized cost function for more accute weight updates, such improvements allow for more accurate and intentional learning.

The major difference under the hood is the manner at which weights are updated. In the Perceptron model weights are updated based on a unit step function where the threshold determined activation, in the Adaline model weights are updated on a linear activation function $\phi(z)$ rather than the step function.

The linear activation function $\phi(z)$, which will be used to learn weights, is simply be the identity of the net input so that: 
$$\phi(w^Tx) = w^Tx$$ 
Thus the linear activation is the product of inputs and their respective weights.

In the Adaline model we still use a threshold function to determine the final prediction. The true class labels will be compared with the output from the activation function's continous value to compute the error and also update the weights. This is contrary to the Perceptron model where the class prediction from the model are compared to the true class label.

Notice that the Perceptron model would update the weights if the predicted output and class label didn't match up. For example a predicted class label 1 and the actual class label 0 would result in updating the weight by 1, likewise a prediction of class 0 and an actual class label of 1 would result in update of -1. With the Adaline model the activation function's continous output .75 compared to the actual class label 1 would result in an updated of 0.25, this makes for a more accurate and deliberate learning paradigm.

From the diagram it is important to point out that the weight update, as well as the error happens are calculated and applied before the threshold function determines the output. In the Perceptron model the update to the weights depend on whether or not the cell was turned on when it should have, in the Adaline model the update to the weights depend on the output of the activation function before the cell is turned on or stays off.

<img src="./images/perceptron-adaline-flow.png" alt="perceptron-adaline-flow" style="width: 75%;"/>


## Minimizing cost functions with Gradient Descent

In supervised learning there is an objective function to be optimized, most often it is a cost function to be minimized. For the Adaline model we define the cost function $J$ to learn the weights as the **Sum of Squared Errors** (**SSE**) between the calculated outcome and the truee class label. The Sum of Squared errors is simply the measure of descrepency between the data and the estimation, *"How far off"* is our prediction. We will define the cost function $J(w)$ as such,

$$J(w) = \frac{1}{2}\sum_{i} \left(y^\left(i\right) - \phi\left(z^\left(i\right)\right)  \right)^2.$$  

The coefficient $\frac{1}{2}$ has been added for convenience for deriving the gradient. 

The linear activation function is much different than the step function; it is continuous and thus can be differentiated. This property and the convex nature of the function allows us to use an algorithm called **gradient descent** to find the optimal values for the weights to minimize the cost function.

Gradient descent can be illustrated as climbing down a hill, each iteration you take a step down the curve by a factor of the slope and learning rate.

<img src="./images/gradient-descent.png" alt="gradient-descent" style="width: 75%;"/>




For every step down the gradient the weights are updated by the product of the learning rate and the negative gradient $\nabla J(w)$ of the cost function $J(w)$. This can be described as,

$$ w := w + \Delta w,$$

where $\Delta w$ is,

$$ \Delta w = -\eta\nabla J(w). $$

To compute the gradient of the cost function we must get the partial derivative of the cost function with respect to each weight $w_j$,

$$ \frac{\partial J}{\partial w_j} = -\sum_i \left( y^{(i)} - \phi\left( z^{(i)} \right) \right)x^{(i)}_j .$$

The update the each weight $w_j$ can now be written as:
$$ \Delta w_j = -\eta\frac{\partial J}{\partial w_j} = \eta\sum_i \left( y^{(i)} - \phi\left( z^{(i)} \right) \right)x^{(i)}_j .$$

This model of gradient descent is known as **batch gadient descent** because the weights are not updated after each training sample but rather the update to the weights is calculated based on all samples.