# Introduction to Deep Learning with Python

## Chapter 1: Introduction

### 1.1 Rosenblatt's Perceptron

The **perceptron** was the first type of a artificial neuron introduced by [Frank Rosenblatt][1] in the late 1950. It's design was inspired by the McCulloch-Pitts model of a neuron. While perceptrons nowadays were replaced by other types of neurons, their basic design continues to exist in modern neural networks. 

A perceptron can be used to learn a *linearly separable* classification task. It takes **inputs** ${[x_1, x_2, ..., x_n]}$ and computes a binary output $y_i$. The **weights** ${[w_1, w_2, ..., w_n]}$ express the importance of the respective inputs to the output. The output is calculated as a weightes sum over the inputs:

$$y_i = \sum_{i}w_i x_i$$

[1]: http://www.ling.upenn.edu/courses/cogs501/Rosenblatt1958.pdf

![Perceptron](00_ressources/img/chapter_1/perceptron.png)

In order to ensure that $y_i$ is a binary outcome, the perceptron uses a *step-function* (*hard limiter*) with an estimated *threshold* also called **bias**:

$$y_i = 
\begin{cases}
    0 &\text{if $w \cdot x+b \leq 0$}\cr  
    1 &\text{if $w \cdot x+b \geq 0$}
\end{cases}$$

where $w \cdot x \equiv \sum_{i}w_i x_i$ is the *dot product* between $x$ and $w$ and $b$ is the threshold. The step-function is a nonlinear function which maps the weighted sum to the desired output. Later we will see, that while modern neuronal networks still require a nonlinear function, their shape is somehow more smooth compared to the step-function (*soft limiter*). 



![Stepfunction](00_ressources/img/chapter_1/step_function.png)

Finally, the perceptron learns by iteratively updating the weight vector $w$ in the following way:

$$w \leftarrow \dot{w} + \nu \cdot (y_i - \hat{y_i}) \cdot x_i $$

where $\dot{w}$ is the new weight vector, $\nu$ is a *learning rate*, $(y_i - \hat{y_i})$ is the error in the current iteration and is the current input $x_i$.

In [2]:
# Coding Rosenblatt's Perceptron from scratch
# -------------------------------------------
import numpy as np
import random

random.seed(1)

# Step function
def unit_step(x):
    if x < 0:
        return(0)
    else:
        return(1)

# Data
X = np.array([[0,0,1], 
              [0,1,1], 
              [1,0,1], 
              [1,1,1]]
            )
# Label
y = np.array([0,1,1,1])

w = np.random.rand(3) # Weights
errors = []           # Errors
eta = 0.2             # Learning rate
n = 100               # Epochs

# Training
for i in range(n):
    # Get row index
    index = random.randint(0,3)
    # Define minibatch (online)
    x_batch = X[index,:]
    y_batch = y[index]
    # Calculate activation
    y_hat = unit_step(np.dot(w, x_batch))
    # Caluclate error
    error = y_batch - y_hat
    errors.append(error)
    # Update weights
    w += eta * error * x_batch

# Prediction  
for index, x in enumerate(X):
    y_hat = np.dot(x, w)
    print("{}: {} -> {} | {}".format(index, round(y_hat, 3), unit_step(y_hat), y[index]))
    


0: -0.081 -> 0 | 0
1: 0.434 -> 1 | 1
2: 0.015 -> 1 | 1
3: 0.53 -> 1 | 1


### 1.2 Limitations of Perceptrons

Shortly after the first publication, the perceptron gained a lot of attention in the early 1960s and was generally considered as a very powerful learning algorithmen. This perspective rapidly switched in the late 1960s and early 1970s after the famous critique by [Minsky and Papert (1969)][2].

Minsky and Papert proved that the perceptron was actually quite limited in what it can learn. More precise, they showed that the perceptron requires the right features in order to learn a classification task correctly. Given enough hand-selected features the perceptron performance still very well. 

[2]: https://mitpress.mit.edu/books/perceptrons


[Source][3]: 
> Once upon a time, the US Army wanted to use neural networks to automatically detect camouflaged enemy tanks. The researchers trained a neural net on 50 photos of camouflaged tanks in trees, and 50 photos of trees without tanks. Using standard techniques for supervised learning, the researchers trained the neural network to a weighting that correctly loaded the training set—output “yes” for the 50 photos of camouflaged tanks, and output “no” for the 50 photos of forest.  is did not ensure, or even imply, that new examples would be classified correctly. 

> The neural network might have “learned” 100 special cases that would not generalize to any new problem. Wisely, the researchers had originally taken 200 photos, 100 photos of tanks and 100 photos of trees. They had used only 50 of each for the training set.  The researchers ran the neural network on the remaining 100 photos, and without further training the neural network classified all remaining photos correctly. 
Success confirmed! The researchers handed the  finished work to the Pentagon, which soon handed it back, complaining that in their own tests the neural network did no better than chance at discriminating photos.

>It turned out that in the researchers’ dataset, photos of camouflaged tanks had been taken on cloudy days, while photos of plain forest had been taken on sunny days. The neural network had learned to distinguish cloudy days from sunny days, instead of distinguishing camouflaged tanks from empty forest.

[3]: http://intelligence.org/files/AIPosNegFactor.pdf

In [3]:
# Same data
X = np.array([[0,0,1], 
              [0,1,1], 
              [1,0,1], 
              [1,1,1]]
            )

# Updated label
y = np.array([1,0,0,1])

# Again training
for i in range(n):
    # Get row index
    index = random.randint(0,3)
    # Define minibatch (online)
    x_batch = X[index,:]
    y_batch = y[index]
    # Calculate activation
    y_hat = unit_step(np.dot(w, x_batch))
    # Caluclate error
    error = y_batch - y_hat
    errors.append(error)
    # Update weights
    w += eta * error * x_batch

# ... and prediction  
for index, x in enumerate(X):
    y_hat = np.dot(x, w)
    print("{}: {} -> {} | {}".format(index, round(y_hat, 3), unit_step(y_hat), y[index]))
    



0: -0.081 -> 0 | 1
1: -0.166 -> 0 | 0
2: 0.015 -> 1 | 0
3: -0.07 -> 0 | 1


Given four observations we obtain the following ineuqualities:

$$w_1 * 0 + w_2 * 0 = 0 \geq b$$
$$w_1 * 0 + w_2 * 1 = w_1 < b$$
$$w_1 * 1 + w_2 * 0 = w_2 < b$$
$$w_1 * 1 + w_2 * 1 = w_1 + w_2 \geq b$$  

Which can be added up as:

$$w_1 + w_2 \geq 2b$$
$$w_1 + w_2 < 2b$$

Unfortunately Minsky and Papert's critique was misunderstood by a large portion of the scientific community: If learning the right features is the essential part and neuronal networks can not learn those features by them selfs, they are generally useless for non-trivial learning problems.    

While this unterstanding is in general not true, it emerged as a common wisdom holding for the next 20 years while
leading to a dramatic decrease in scientific interest in neural networks for learning problems. It was not before the late 1980s and early 1990s, when people started to realize that learning the features resolves this limitation.  

Recognizing that learning good features is the real problem, not only affected the further neural network research but also gave rise to other learning algorithms. One of the most prominent ones is the *Support Vector Machine (SVM)*, where the learning problem is solved by transforming the input space into a nonlinear feature space. 