# The ADALINE 

The Adaline algorithm (Adaptive Linear Algorithm) was proposed in 1959, shortly after Rosenblatt’s perceptron, by **Bernard Widrow** and **Ted Hoff** (one of the inventors of the microprocessor) at Stanford. Widrow and Hoff were electrical engineers, yet Widrow had attended the famous *Dartmouth Summer Research Project on Artificial Intelligence* in 1956, experience that got him interested in the idea of building brain-like artificial learning systems. When Widrow moved from MIT to Stanford, a colleague asked him whether he would be interested in taking Ted Hoff as his doctoral student. Widrow and Hoff came up with the Adaline algorithm on a Friday during their first session working together. At that time, implementing an algorithm in a mainframe computer was slow and expensive, so they decided to build a small device capable of being trained by the Adaline algorithm to learn to classify patterns of inputs.  

The main difference between the perceptron and Adaline, is that Adaline works by minimizin the **sum of squares of the linear errors** over a training set. This means that the learning rule is based on a **linear activation function** rather than a unit step function as in the perceptron. This is important as allows the minimization of a continuous cost function. Continuous cost functions have the advantage of being differentiable, which allows training neural nets by using the chain rule of calculus, which opened the door to train more complex algorithms like non-linear multilayer perceptrons, logistic regression, support vector machines, and others.

## Formal definition of ADALINE

As we mentioned before, Adaline uses a **linear activation function**, which is essencially the **identity function** of the net input to the network. This is defined as:

$$\hat{y} = \sum_{j=1}^n x_j w_j + \theta$$


where:  
- $\hat{y}$ is the output of the model (real value scalar)  
- $x$ is a real value input vector   
- $w$ is a real vaue weight vector
- $\theta$ is a bias term  

It is crucial to notice that even when we use a linear activation function to compute the output of the network, if we attempt to make a **binary classification decision**, we will still use a step-like decision function by taking the sign as: 

$$  sgn(\hat{y}) =
\begin{cases}
 +1,  & \text{if $\hat{y}$ > 0} \\
-1, & \text{otherwise}
\end{cases}
$$


## A note on the Perceptron and Adaline fundamental difference

At this point you may be wondering what's the difference between the Perceptron and Adaline considering that both end up using a step-function to make classifications. The difference is the **learning rule to update the weight** of the network. The perceptron update the weights by computing the difference between the expected and predicted **class labels**. In practice, this means that the perceptron is comparing three types of discrete values: -1, 0, and 1. On the other hand, Adaline computes the difference between the expected class label (i.e., -1 or 1) and the **continious real value** of the linear activation function. 


## Learning rule

As we mentioned before, the Adaline learning rule consist of comparing the expected class label to the predicted continous real value output. To achieve this, Adaline uses the **LMS (least mean square) algorithm**, also know as **Widrow-Hoff Delta Rule**, which minimize the sum of squares of the linear errors over the training set. In the machine learning literature, this is know as a **cost funtion** to be minimized. This is defined as:

$$L = \sum_{j=1}^n (\hat{y}_j-y_j)^2$$

where: 
- $\hat{y}$ is the output of the model (real value scalar)  
- $y$ is the expected class label (-1 or +1)

Now the question is how to minimize sum of squares erros (SSE). We do this by adjusting the values of $w$ vector. Since we are working with a continous value function, we can compute the change in the SSE with respect to changes in $w$ by applying the **gradient descent algorithm**. Therefore, we update the values of $w$ by:

$$w_{j+1} \leftarrow w_j + \eta(- \Delta_j)$$

where:
- $\eta$ is the learning rate (positive constant)
- $\Delta_j$ is the value of the gradient at a point in the SSE surface 

This algorithm works by taking steps of a size controled by the learning rate $\eta$, on the surface defined by the vector of weights. A common way to express this idea is in analogy to climbing: if you're in a mountain, you can ascent by **climbing up-hill** or descent by **climbing down-hill**. Since the surface defined by this quadratic function is convex (think in a bowl) and has a unique global minimun, **we want to go down-hill** (i.e., we do *gradient descent*) where the SSE is minimized.  

To obtain the gradient in a given point the convex-surface, we compute the partial derivative of the cost function $L$ with respect to each weight in the weight vector as:

$$\frac{\partial L} {\partial w_j} = -\sum_{i}(y_i - \hat{y_i})x_{ji}$$ 

Finally, by replacing terms, the update rule can be writen as:

$$\Delta w_j = -\eta \frac{\partial L} {\partial w_j} = \eta\sum_{i}(y_i - \hat{y_i})x_{ji}$$



## Adaline algorithm implementation

We will implement the Adaline algorithm from scrath with Python and Numpy (a Python package for scientific computing). The goal is to understand the perceptron step-by-step execution rather than achieving an elegant implementation. I'll break down each step into functions to ensemble everything at the end. 


### Generate vector of random weights

In [3]:
import numpy as np

def random_weights(X, random_state: int):
    '''create vector of random weights
    Parameters
    ----------
    X: 2-dimensional array, shape = [n_samples, n_features]
    Returns
    -------
    w: array, shape = [w_bias + n_features]'''
    rand = np.random.RandomState(random_state)
    w = rand.normal(loc=0.0, scale=0.01, size=1 + X.shape[1])
    return w

Predictions from Adaline are obtained by a linear combination of features and weights. It is common practice to begin with a vector of small random weights that would be updated later by the Adaline learning rule.

### Compute net input

In [4]:
def net_input(X, w):
    '''Compute net input as dot product'''
    return np.dot(X, w[1:]) + w[0]

Here we pass the featue matrix and the previously generated vector of random weights to compute the inner product. Remember that we need to add an extra weight for the bias term at the begining of the vector (`w[0`)

### Compute activation

In [5]:
def activation(X):
    '''Compute linear activation'''
    return X

Note that the activation function returns the same values passed in. As we mentioned earlier, the linear activation function of Adaline, is the **identity function**, which means exactly this: units will be activated in direct proportion to the output of the linear combination of vectors and weights. Technically, we might not use this function and the result will be the same. Yet, we add this for conceptual completeness. 

### Compute predictions

In [7]:
def predict(X, w):
    '''Return class label after unit step'''
    return np.where(net_input(X, w) >= 0.0, 1, -1)

Remember that although Adaline learning rule works by comparing the output of a linear function against the class labels, when doing predictions, we still need to pass the output by a *sgn function* to get class labels as in the perceptron.

### Training loop - Learning rule

In [9]:
def fit(X, y, eta=0.001, n_iter=100):
    '''loop over exemplars and update weights'''
    costs = []
    w = random_weights(X, random_state=1)
    for i in range(n_iter):
        net_input = net_input(X, w)
        output = activation(net_input) # identity function
        errors = (y - output) # compute errors for the entire dataset
        w[1:] += eta * X.T @ errors # update weigths for the entire dataset (feature weights)
        w[0] += eta * errors.sum() # update weigths for the entire dataset (bias-term weights)
        cost = (errors**2).sum() / 2.0 
        costs.append(cost)
 
    return w, costs

Let's examine the fit method that implements the Adaline learning rule:

* Create a vector of random weights by using the `random_weights` function with dimensionality equal to the number of columns in the feature matrix
* Loop over the entire dataset `n_iter` times with `for i in range(n_iter)`
* Compute the inner product between the feature matrix $X$ and the weight vector $w$ by using the `net_input(X, w)` function
* Compute the difference between the predicted values and the target values for the entire dataset `(y - output)`
* Update the weights in proportion to the learning rate $\eta$ by `w[1:] += eta * X.T @ errors` and `w[0] += eta * errors.sum()`
* Compute the SSE `cost = (errors**2).sum() / 2.0 `
* Save the SSE for each iteration `costs.append(cost)`

## Testing the Adaline

In [10]:
#### TODO ####

## The Linear Separability Constrain

In [11]:
#### TODO ####

## References and Further Reading

- Widrow, B., & Hoff, M. E. (1960). Adaptive switching circuits (No. TR-1553-1). Stanford Univ Ca Stanford Electronics Labs.

- Widrow, B., & Lehr, M. A. (1990). 30 years of adaptive neural networks: perceptron, madaline, and backpropagation. Proceedings of the IEEE, 78(9), 1415-1442.

- Widrow, B., & Lehr, M. A. (1995). Perceptrons, Adalines, and backpropagation. The handbook of brain theory and neural networks, 719-724.

**For code implementation:** 

- Raschka, S. (2015). Python machine learning. Packt Publishing Ltd.