# Table of Content

[Neural Networks](#NeuralNetworks)

* [Introduction](#Introduction)
	* [History](#History)
	* [Perceptrons](#Datasets)





# <a id="NeuralNetworks"></a>Neural Networks

## <a id="Introduction"></a>Introduction

### <a id="BiologicalNeurons"></a>Biological Neurons

Many tasks that involve intelligence, pattern recognition, object classifications or detection are difficult
to implement using classical software engineering principles, even those tasks can easily be performed by animals or young children. 

For example, family cat can easily recognize you, versus a stranger?
Small child can easily recognize who is dad, and who is mom.

Human brains can perform complex pattern recognition tasks without even noticing ?
How brans can do that ?

The answer lies in our bodies. Each of us contains a real-life biological neural
networks that is connected to our nervous systems. Network is composed of a large number of
interconnected neurons (nerve cells).

One brain has approximately 10 billion neurons, each connected to about 10,000
other neurons. The cell body of the neuron is called the ***soma***, where the inputs (dendrites) and
outputs (axons) connect soma to other soma.

Each neuron receives electrochemical inputs from other neurons at their ***dendrites***. If these
electrical inputs are powerful enough activate the neuron, then the activated neuron transmits
the signal along its ***axon***, passing it along to the ***dendrites*** of other neurons. These attached neurons
may also fire, thus continuing the process of passing the message along.

Firing a neuron  is a binary operation – the neuron either
fires or it doesn’t fire. There are no different ***grades*** of firing. Neuron will fire only 
if the total signal received at the ***soma*** exceeds a given threshold.

![Image](course/assets/image/biological-neurons.png)

***Dendrite***: Receives signals from other neurons   
***Soma***: Processes the information   
***Axon***: Transmits the output of this neuron   
***Synapse***: Point of connection to other neurons   

Can we simulate neural network from nature ?

So, if we simulate brain structure then we should try to implement computation system composed of the connected nodes, 
where on each node we will execute a simple computation. Such a structure can be implemented using directed graph.
From graph theory, we know that a directed graph consists of a set of nodes (i.e., vertices's) and a set of 
connections (i.e., edges) that are link together.

Each node performs a simple computation. Each connection then carries a signal (i.e., the
output of the computation) from one node to another, labeled by a weight indicating the extent to
which the signal is amplified or diminished. Some connections have large, positive weights that
amplify the signal, indicating that the signal is very important when making a classification. 
Others have negative weights, diminishing the strength of the signal, thus specifying that the output of
the node is less important in the final classification. 

Initially connection weights are defined with random values, which are modified using learning algorithm.

Such a system is Artificial Neural Network.

The word ***neural*** is the adjective form of ***neuron***, and ***network*** denotes a graph-like
structure, therefore, an ***Artificial Neural Network*** is a computation system that attempts to simulate the neural 
connections in our nervous system. 

Artificial neural networks are also referred to as ***neural networks***. It is common to abbreviate
Artificial Neural Network and refer to them as ***ANN*** or simply ***NN**.

### <a id="ArtificialNeurons"></a>Artificial Neurons

In 1943 ***Warren S. McCulloch***, a neuroscientist, and ***Walter Pitts, a logician***, published a paper ***A logical calculus of the ideas immanent in nervous activity***. In this paper McCulloch and Pitts tried to understand how the brain could produce highly complex patterns by using many basic cells that are connected together. These basic brain cells are called neurons, and McCulloch and Pitts gave a highly simplified model of a neuron in their paper. 

The McCulloch and Pitts model of a neuron, which we will call an **MCP neuron** for short, has made an important contribution to the development of artificial neural networks -- which model key features of biological neurons.

Model is divided into 2 parts. The first part, ***g*** takes an input, performs an aggregation and based on the aggregated value the second part, ***f*** makes a decision.

![Image](course/assets/image/McCullochPittsNeuron.png)

The original ***MCP neuron** had limitation, so the the next major development in neural networks was the concept of a ***perceptron*** which was introduced by ***Frank Rosenblatt*** in 1958. Further refined and carefully analyzed by **Minsky** and ***Papert*** (1969) — their model is referred to as the ***perceptron*** model.

Essentially the ***perceptron*** is an ***MCP neuron*** where the inputs are first passed through some ***preprocessors*** which are called association units. These association units detect the presence of certain specific features in the inputs. In fact, as the name suggests, a perceptron was intended to be a pattern recognition device, and the association units correspond to feature or pattern detectors.

### Perceptrons

![Image](course/assets/image/perceptron-model.png)


The perceptron model, proposed by Minsky-Papert, is a more general computational model than ***MCP neuron***. It overcomes some of the limitations of the ***MCP neuron*** by introducing the concept of numerical weights (a measure of importance) for inputs, and a mechanism for learning those weights. Inputs are no longer limited to boolean values like in the case of an ***MCP neuron***, it supports real inputs as well which makes it more useful and generalized.

It takes an input, aggregates it (weighted sum) and returns 1 only if the aggregated sum is more than some threshold else returns 0. 

Perceptron contains N input nodes, one for each entry in the input row, followed by only one layer in the network with just a single node in that layer.

Each input has corresponding weights w1,w2, ... wi from the input. Node takes the weighted sum of inputs and applies a step
function(activation function) to determine the output class label. The Perceptron outputs either a 0 or a 1:
* 0 for class 1   
* 1 for class 2   

thus, in its original form, the Perceptron is simply a binary, two class classifier.

#### Perceptron Training Procedure and the Delta Rule

Training a Perceptron is a fairly straightforward operation. 

***Goal of the training procedure is to find a set of weights w that correctly classifies each instance in our training set.***


In order to trainr Perceptron, we will iteratively feed the network with our training data multiple times. Each time the network has seen the full set of training data, we say an epoch has passed. It normally takes many epochs until a
weight vector w can be learned to linearly separate our two classes of data.

The pseudocode for the Perceptron training algorithm can be found below:

The actual “learning” takes place in Steps 2b and 2c. First, we pass the feature vector xj through
the network, take the dot product with the weights w and obtain the output yj. 

This value is then passed through the step function which will return 1 if x > 0 and 0 otherwise.

Now we need to update our weight vector w to step in the direction that is “closer” to the
correct classification. 

This update of the weight vector is handled by the delta rule in Step 2c.

The expression (dj􀀀yj) determines if the output classification is correct or not. If the classification
is correct, then this difference will be zero. Otherwise, the difference will be either positive or
negative, giving us the direction in which our weights will be updated (ultimately bringing us closer
to the correct classification). We then multiply (dj 􀀀yj) by xj, moving us closer to the correct
classification.

Pseudo code:

1. Initialize weights vector w with small random values
2. Until Perceptron converges:
 (a) Loop over each feature vector xj and true class label di in our training set D   
 (b) Take x and pass it through the network, calculating the output value: yj = f (w(t) xj)   
 (c) Update the weights w: wi(t +1) = wi(t)+a(dj 􀀀yj)xj;i for all features 0 <= i <= n   


The value delta is our learning rate and controls how large (or small) of a step we take. It’s
critical that this value is set correctly. A larger value of a will cause us to take a step in the
right direction; however, this step could be too large, and we could easily overstep a local/global
optimum.
Conversely, a small value of a allows us to take tiny baby steps in the right direction, ensuring
we don’t overstep a local/global minimum; however, these tiny baby steps may take an intractable
amount of time for our learning to converge.

Once we understand algorithm behind Perceptrons, let's first implement Perceptron class in Python:

In [68]:
import numpy as np

class Perceptron:
    
    def __init__(self, number_of_inputs, learning_rate=0.1):
        # initialize the weight matrix
        np.random.seed(7)
        self.W = np.random.randn(number_of_inputs + 1) / np.sqrt(number_of_inputs)
        self.learning_rate = learning_rate
        
    # activation function    
    def step(self, x):
        return 1 if x > 0 else 0
        
    def fit(self, X, y, epochs=10 ):
        
        # insert a column of 1's as the last entry in the feature matrix (bias)
        X = np.c_[X, np.ones(X.shape[0])]
        
        # start training
        for epoch in np.arange(0, epochs):
            # loop over each input data
            for (x, target) in zip(X, y):
                
                # calculate dot product of the input features and the weight matrix, 
                # then pass calculated value through the step function 
                prediction = self.step(np.dot(x, self.W))
                
                # update weights, if prediction is not same as expected target value
                if prediction != target:
                    # calculate error
                    error = prediction - target
                    
                    # update the weight matrix
                    self.W += -self.learning_rate * error * x
                    
    def predict(self, X):
        
        # insert a column of 1's as the last entry in the feature matrix (bias)
        X = np.c_[X, np.ones((X.shape[0]))]

        # take the dot product of the input features
        # and the weight matrix, then pass calculated value
        # through the step function
        return self.step(np.dot(X, self.W))
        

#### Evaluating Perceptron Algorithm 

Let's test Perceptron against AND function:

In [69]:
# define AND dataset
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([[0], [0], [0], [1]])
       
# obtain perceptron
p = Perceptron(X.shape[1], learning_rate=0.1)

# train
p.fit(X, y, epochs=20)

# now that our network is trained, loop over the data points
for (x, target) in zip(X, y):
    
    # make a prediction, note: x must be matrix
    pred = p.predict(np.atleast_2d(x))
    print("[INFO] input-data={}, target-value={}, predicted-value={}".format(x, target[0], pred))


[INFO] input-data=[0 0], target-value=0, predicted-value=0
[INFO] input-data=[0 1], target-value=0, predicted-value=0
[INFO] input-data=[1 0], target-value=0, predicted-value=0
[INFO] input-data=[1 1], target-value=1, predicted-value=1


We can see that Perceptron is successfully trained to predict AND fnction. Now let's try with OR function:

In [70]:
# define OR dataset
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([[0], [1], [1], [1]])

# obtain perceptron
p = Perceptron(X.shape[1], learning_rate=0.1)

# train
p.fit(X, y, epochs=20)

# now that our network is trained, loop over the data points
for (x, target) in zip(X, y):
    # make a prediction on the data point and display the result
    # to our console
    pred = p.predict(np.atleast_2d(x))
    print("[INFO] input-data={}, target-value={}, predicted-value={}".format(x, target[0], pred))

[INFO] input-data=[0 0], target-value=0, predicted-value=0
[INFO] input-data=[0 1], target-value=1, predicted-value=1
[INFO] input-data=[1 0], target-value=1, predicted-value=1
[INFO] input-data=[1 1], target-value=1, predicted-value=1


Perceptron is successfully trained to predict OR function as well. Now let's try with XOR function:

In [71]:
# define XOR dataset
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([[0], [1], [1], [0]])

# obtain perceptron
p = Perceptron(X.shape[1], learning_rate=0.1)

# train
p.fit(X, y, epochs=20)

# now that our network is trained, loop over the data points
for (x, target) in zip(X, y):
    # make a prediction on the data point and display the result
    # to our console
    pred = p.predict(np.atleast_2d(x))
    print("[INFO] input-data={}, target-value={}, predicted-value={}".format(x, target[0], pred))

[INFO] input-data=[0 0], target-value=0, predicted-value=1
[INFO] input-data=[0 1], target-value=1, predicted-value=0
[INFO] input-data=[1 0], target-value=1, predicted-value=0
[INFO] input-data=[1 1], target-value=0, predicted-value=0


Here we can see that Perceptron is not able be trained to predict XOR gate. 
We can play with different learning rates or different
weight initialization schemes or with different number of epochs, but Perceptron will never be able to correctly model the 
XOR function. 

Reason, XOR function is non linear problem, so we will need neural  network with is more layers and 
with nonlinear activation functions.

In [76]:
import numpy as np

class NeuralNetwork:
    def __init__(self, layers, learning_rate=0.1):
        
        self.W = []
        self.layers = layers
        self.learning_rate = learning_rate

        # initialize weights
        # stop before we reach the last two layers
        for i in np.arange(0, len(layers) - 2):
            # randomly initialize a weight matrix connecting the
            # number of nodes in each respective layer together,
            # adding an extra node for the bias
            w = np.random.randn(layers[i] + 1, layers[i + 1] + 1)
            self.W.append(w / np.sqrt(layers[i]))

        # the last two layers are a special case where the input
        # connections need a bias term but the output does not
        w = np.random.randn(layers[-2] + 1, layers[-1])
        self.W.append(w / np.sqrt(layers[-2]))

    # sigmoif activation function
    def sigmoid(self, x):
        return 1.0 / (1 + np.exp(-x))

    # derivate of the sigmoid function
    def sigmoid_deriv(self, x):
        return x * (1 - x)

    def fit(self, X, y, epochs=1000, displayUpdate=100):
        
        # insert a column of 1's as the last entry in the feature matrix (bias)
        X = np.c_[X, np.ones((X.shape[0]))]

        # start training
        for epoch in np.arange(0, epochs):
            # loop over each individual data point 
            for (x, target) in zip(X, y):
                self.fit_partial(x, target)

            # check to see if we should display a training update
            if epoch == 0 or (epoch + 1) % displayUpdate == 0:
                loss = self.calculate_loss(X, y)
                print("[INFO] epoch={}, loss={:.7f}".format(
                epoch + 1, loss))

    def fit_partial(self, x, y):
        # construct our list of output activations for each layer
        # as our data point flows through the network; the first
        # activation is a special case -- it's just the input
        # feature vector itself
        A = [np.atleast_2d(x)]

        # FEEDFORWARD:
        # loop over the layers in the network
        for layer in np.arange(0, len(self.W)):
            # feedforward the activation at the current layer by
            # taking the dot product between the activation and
            # the weight matrix -- this is called the "net input"
            # to the current layer
            net = A[layer].dot(self.W[layer])

            # computing the "net output" is simply applying our
            # non-linear activation function to the net input
            out = self.sigmoid(net)

            # once we have the net output, add it to our list of
            # activations
            A.append(out)

        # BACKPROPAGATION
        # the first phase of backpropagation is to compute the
        # difference between our *prediction* (the final output
        # activation in the activations list) and the true target
        # value
        error = A[-1] - y

        # from here, we need to apply the chain rule and build our
        # list of deltas `D`; the first entry in the deltas is
        # simply the error of the output layer times the derivative
        # of our activation function for the output value
        D = [error * self.sigmoid_deriv(A[-1])]

        # once you understand the chain rule it becomes super easy
        # to implement with a `for` loop -- simply loop over the
        # layers in reverse order (ignoring the last two since we
        # already have taken them into account)
        for layer in np.arange(len(A) - 2, 0, -1):
            # the delta for the current layer is equal to the delta
            # of the *previous layer* dotted with the weight matrix
            # of the current layer, followed by multiplying the delta
            # by the derivative of the non-linear activation function
            # for the activations of the current layer
            delta = D[-1].dot(self.W[layer].T)
            delta = delta * self.sigmoid_deriv(A[layer])
            D.append(delta)

        # since we looped over our layers in reverse order we need to
        # reverse the deltas
        D = D[::-1]

        # WEIGHT UPDATE PHASE
        # loop over the layers
        for layer in np.arange(0, len(self.W)):
            # update our weights by taking the dot product of the layer
            # activations with their respective deltas, then multiplying
            # this value by some small learning rate and adding to our
            # weight matrix -- this is where the actual "learning" takes
            # place
            self.W[layer] += -self.learning_rate * A[layer].T.dot(D[layer])

    def predict(self, X, addBias=True):
        # initialize the output prediction as the input features -- this
        # value will be (forward) propagated through the network to
        # obtain the final prediction
        p = np.atleast_2d(X)

        # check to see if the bias column should be added
        if addBias:
            # insert a column of 1's as the last entry in the feature
            # matrix (bias)
            p = np.c_[p, np.ones((p.shape[0]))]

        # loop over our layers in the network
        for layer in np.arange(0, len(self.W)):
            # computing the output prediction is as simple as taking
            # the dot product between the current activation value `p`
            # and the weight matrix associated with the current layer,
            # then passing this value through a non-linear activation
            # function
            p = self.sigmoid(np.dot(p, self.W[layer]))

        # return the predicted value
        return p

    def calculate_loss(self, X, targets):
        # make predictions for the input data points then compute
        # the loss
        targets = np.atleast_2d(targets)
        predictions = self.predict(X, addBias=False)
        loss = 0.5 * np.sum((predictions - targets) ** 2)

        # return the loss
        return loss

In [73]:
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([[0], [1], [1], [0]])

nn = NeuralNetwork([2, 2, 1], learning_rate=0.5)
nn.fit(X, y, epochs=40000)

for i in range(0, 4):
    x = X[i]
    pred = nn.predict(x)
    step = 1 if pred > 0.5 else 0
    print("[INFO] data={}, expected={}, pred={}, step={}".format(x, y[i], pred, step))

[INFO] epoch=1, loss=0.4993851
[INFO] epoch=100, loss=0.4932219
[INFO] epoch=200, loss=0.4768772
[INFO] epoch=300, loss=0.4373929
[INFO] epoch=400, loss=0.3808364
[INFO] epoch=500, loss=0.3247498
[INFO] epoch=600, loss=0.2598484
[INFO] epoch=700, loss=0.1452399
[INFO] epoch=800, loss=0.0744832
[INFO] epoch=900, loss=0.0444825
[INFO] epoch=1000, loss=0.0301734
[INFO] epoch=1100, loss=0.0222834
[INFO] epoch=1200, loss=0.0174274
[INFO] epoch=1300, loss=0.0141909
[INFO] epoch=1400, loss=0.0119028
[INFO] epoch=1500, loss=0.0102110
[INFO] epoch=1600, loss=0.0089155
[INFO] epoch=1700, loss=0.0078952
[INFO] epoch=1800, loss=0.0070731
[INFO] epoch=1900, loss=0.0063979
[INFO] epoch=2000, loss=0.0058343
[INFO] epoch=2100, loss=0.0053575
[INFO] epoch=2200, loss=0.0049493
[INFO] epoch=2300, loss=0.0045962
[INFO] epoch=2400, loss=0.0042880
[INFO] epoch=2500, loss=0.0040168
[INFO] epoch=2600, loss=0.0037764
[INFO] epoch=2700, loss=0.0035621
[INFO] epoch=2800, loss=0.0033698
[INFO] epoch=2900, loss=0.

[INFO] epoch=24400, loss=0.0002390
[INFO] epoch=24500, loss=0.0002379
[INFO] epoch=24600, loss=0.0002369
[INFO] epoch=24700, loss=0.0002358
[INFO] epoch=24800, loss=0.0002348
[INFO] epoch=24900, loss=0.0002338
[INFO] epoch=25000, loss=0.0002327
[INFO] epoch=25100, loss=0.0002317
[INFO] epoch=25200, loss=0.0002307
[INFO] epoch=25300, loss=0.0002297
[INFO] epoch=25400, loss=0.0002288
[INFO] epoch=25500, loss=0.0002278
[INFO] epoch=25600, loss=0.0002268
[INFO] epoch=25700, loss=0.0002259
[INFO] epoch=25800, loss=0.0002249
[INFO] epoch=25900, loss=0.0002240
[INFO] epoch=26000, loss=0.0002230
[INFO] epoch=26100, loss=0.0002221
[INFO] epoch=26200, loss=0.0002212
[INFO] epoch=26300, loss=0.0002203
[INFO] epoch=26400, loss=0.0002194
[INFO] epoch=26500, loss=0.0002185
[INFO] epoch=26600, loss=0.0002176
[INFO] epoch=26700, loss=0.0002167
[INFO] epoch=26800, loss=0.0002158
[INFO] epoch=26900, loss=0.0002149
[INFO] epoch=27000, loss=0.0002141
[INFO] epoch=27100, loss=0.0002132
[INFO] epoch=27200, 