# PERCEPTRON RULE

<br>

## Introduction

<br>
Although we may be more interested in neural networks of many interconnected units, let's begin by understanding how to update the weights for a single node : in this simple case, the learning problem is to determine a weight vector that causes the neuron to produce the correct output for each of the given training examples.

<br>
Several algorithms are known to solve this learning problem, but in this notebook we are going to consider two of them : <b>the perceptron rule and the delta rule</b>. These two algorithms are guaranteed to converge to somewhat different acceptable hypotheses, under somewhat different conditions, but they are both important to ANN because they <b>provide the basis for learning networks of many units</b>.


## The Perceptron Rule

<br>
The perceptron algorithm is about <b>learning the weights for the input signals in order to draw a linear decision boundary</b> that allows us to discriminate between the two linearly separable classes. Rosenblatt’s initial perceptron rule is fairly simple and can be summarized by the following steps:

<br>
<ul style="list-style-type:square">
    <li>
        initialize the perceptron weights to zero or small random numbers
    </li>
    <br>
    <li>
        <b>for each training sample</b>, calculate the perceptron output value and <b>update the weights</b> whenever it
        misclassifies an instance
    </li>
    <br>
    <li>
        the last routine is then <b>repeated</b>, each time iterating through all training examples, as many times as needed
        <b>until the perceptron classifies all training examples correctly</b>
    </li>
</ul>

<br><br>
Weights are modified at each step according to the perceptron training rule, which is defined as follows : 

<br>
$
    \quad
    \begin{align}
        w_j \leftarrow w_j + \Delta w_j
        \qquad \text{where} \quad \Delta w_j = \eta \ (\text{target}^{(i)} - \text{output}^{(i)}) \ x^{(i)}_{j}
        \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad 
        \qquad \qquad \qquad \qquad & [\textbf{E1}]   
    \end{align}
$

<br>
Here $ \eta $ is a positive constant (between 0.0 and 1.0) called the learning rate; the role of the learning rate is to moderate the degree to which weights are changed at each step, and is sometimes made to decay as the number of weight-tuning iterations increases.

<br>
It is important to note that <b>all weights are being updated simultaneously</b>; for a 2-dimensional dataset, we would write the update as $ w \leftarrow w + \Delta w $ :

<br>
$
    \quad
    \begin{align}
        \Delta w_0 \quad &= \quad \eta \ (\text{target}^{(i)} - \text{output}^{(i)})               \newline
        \Delta w_1 \quad &= \quad \eta \ (\text{target}^{(i)} - \text{output}^{(i)}) \ x^{(i)}_{1} \newline
        \Delta w_2 \quad &= \quad \eta \ (\text{target}^{(i)} - \text{output}^{(i)}) \ x^{(i)}_{1}
    \end{align}
$


### Convergence

<br>
Why should this rule converge toward successful weight values? To get an intuitive feel, let's consider some specific cases. Suppose the training example is correctly classified already by the perceptron; in these two scenarios, the weights remain unchanged :

<br>
$
    \quad
    \begin{align}
        \Delta w_j \quad &= \quad 
        \begin{cases}
            \eta \ [(-1) - (-1)] \ x^{(i)}_{j} &= 0 
                \qquad \qquad & \text{if} \ \text{target}^{(i)} = \text{output}^{(i)} = -1 \\
            \eta \ [(+1) - (+1)] \ x^{(i)}_{j} &= 0 
                \qquad \qquad & \text{if} \ \text{target}^{(i)} = \text{output}^{(i)} = +1
        \end{cases}
    \end{align}
$

<br>
Suppose now that the perceptron miclassifies the training example. In case of a wrong prediction, the weights are being "pushed" towards the direction of the positive or negative target class : 

<br>
$
    \quad
    \begin{align}
        \Delta w_j \quad &= \quad 
        \begin{cases}
            \eta \ [(+1) - (-1)] \ x^{(i)}_{j} \ = \ \eta \ (2) \ x^{(i)}_{j} 
                \quad & \text{if} \ \text{target}^{(i)} = +1 \quad \text{and} \quad \text{output}^{(i)} = -1 \\
            \eta \ [(-1) - (+1)] \ x^{(i)}_{j} \ = \ \eta \ (-2) \ x^{(i)}_{j} 
                \quad & \text{if} \ \text{target}^{(i)} = -1 \quad \text{and} \quad \text{output}^{(i)} = +1 \\
        \end{cases}
    \end{align}
$

<br>
In the first case, the perceptron miclassifies a positive instance as negative. <b>In order to make the perceptron change its output, the weights must be altered to increase the value of the dot product $(w \cdot x)$ ; this will increase the chances of the dot product being on the right side of the decision hyperplane and bring the perceptron closer to a correct classification</b>. 

<br><br>
<b>The perceptron training rule can be proven to converge</b> within a finite number of iterations to a weight vector that correctly classifies all training examples, <b>provided the following conditions</b> :

<ul style="list-style-type:square">
    <li>
        <b>the task is linear</b> in nature (or, in other words, the training examples are linearly separable). <b>When the data
        set is not linearly separable, convergence is not assured</b>; in this case, we can set a maximum number of iterations
        over the training dataset and/or a threshold for the number of tolerated misclassifications
    </li>
    <br>
    <li>
        <b>a sufficiently small value of the learning rate</b> is used
    </li>
</ul>


## References

<br>
<ul style="list-style-type:square">
    <li>
        Tom Mitchell - Machine Learning <br>
        http://www.cs.ubbcluj.ro/~gabis/ml/ml-books/McGrawHill%20-%20Machine%20Learning%20-Tom%20Mitchell.pdf
    </li>
    <br>
    <li>
        Sebastian Raschka - Single Layer Neural Networks and Gradient Descent <br>
        http://sebastianraschka.com/Articles/2015_singlelayer_neurons.html#the-perceptron-learning-rule
    </li>
</ul>
