#Introduction to Neural Networks

##What is a Neural Network?

Neural networks have been receiving a lot of attention lately because of their success in computer vision (e.g. Google's Deep Dream images), speech recognition and synthesis, and pattern classification. So this begs the question, "What exactly is a neural network?".

Artificial neural networks, or ANNs, are a family of computational models inspired by the brain which are used to approximate highly complex functions. A single neural network is composed of many highly interconnected processing elements (the nodes aka neurons) which each perform some basic computation on their inputs. Each neuron typically receives many inputs from other neurons in the network, and outputs a nonlinear function of the sum of its weighted inputs. The output from this neuron is then passed through the network as input to the next set of neurons. In this way, the neural network "deconstructs" and parallelizes its inputs, and then "reconstructs" a target state or classification.

There are many types of neural networks, each of which has its own connectivity and dynamics, however all neural networks share the concept of training by modifying its connection weights. It is by manipulating these weights that the neural network learns to reconstruct the desired targets in the case of supervised learning, or an efficient compression of its inputs in the case of unsupervised learning.

In this tutorial, we'll take a look at one of the simplest neural networks: the multilayer perceptron (MLP), also known as the feedforward neural network.

##The Multilayer Perceptron

<img src="MultiLayerPerceptron.png">

Shown above is a simple multilayer perceptron. In this diagram, each of the circles represents a node, or neuron, in the network. Each neuron receives input from the previous layer of neurons (except the first layer, which receives the raw input), computes a nonlinear function on the sum of its inputs, and sends its output to the next layer of neurons. Notice that there are no arrows between neurons within a layer, or going backwards between layers. For this reason, MLPs are also known as "feedforward" networks to distinguish them with "recurrent" networks, which have much more complex dynamics.

The network above has three layers; the first layer is called the input layer, the second layer is known as the hidden layer, and the last layer is called the output layer. MLPs can have any number of hidden layers, but always have one input and one output layer. Each layer can be composed of any number of neurons, provided there are no connections within a layer.

When the network is first initialized, random weights are assigned to the connections between the individual neurons. When the network is given an input, it uses those random connection weights to pass the activity forward one layer at a time, eventually reaching the output layer. 

When the activity reaches the output layer, the activity in the output layer is compared with a set of target values (thus the MLP is a supervised learning algorithm). These error values are then used to change the weights of the connections such that the error is reduced by a small amount. Over time, the error is minimized more and more until the network begins to correctly classify its input.

##Activation Functions

That's all well and good, but how do we *actually* perform the feedforward pass? What nonlinear function should the neurons compute? In practice, the logistic function is the most often used activation function, for several reasons:

* It is simple to compute
* Its derivative is equally simple to compute (important for reasons given below)
* It only outputs values between 0 and 1

The **logistic function** is given by the following formula:

$$f(x) = \frac{1}{ 1 + e^{-x} }$$

If we have a neuron $j$ with inputs $x_i$ and corresponding connection weights $w_{ij}$, the total input into neuron $j$ is given by:

$$I_j = \sum_{i=1}^{N} x_iw_{ij}$$

Where N is the total number of inputs into neuron $j$. Hence, the output of neuron $j$ is:

$$y_j = \frac{1}{ 1 + e^{-I_j} }$$

This process of summing weighted inputs and squashing them with the sigmoid function is repeated for each neuron in the network until the values of the output neurons are finally computed.

##Backpropagation of Error

Now that we know how to pass the activity forward through the network, we can obtain the error for each of the output neurons by comparing their activity with their target values. The question now becomes, how do we use the error information to change the weights of the network?

This is done via a process called the "backpropagation of error". Let's first define the error measure we're trying to minimize. Here, we're going to use the sum of squared error with a factor of a half to make the derivative nice:

$$E = \frac{1}{2}\sum_{j=1}^{N_j} (t_j - y_j)^2$$

Where $y_j$ is the actual output, and $t_j$ is the target output for neuron $j$. We can take the derivative of this error to find out how the error will change as we change the activity of the output neuron.

$$\frac{\partial E}{\partial y_j} = -(t_j - y_j)$$

This gives us an idea of *how fast* the error is changing as we change $y_i$, and in *which direction* it changes. We can use this information to adjust the input weights of a neuron such that it **decreases the error**. To do that, we need error derivatives for *all* the neurons in the network, not just the output neurons.

Luckily, we can use the error derivatives with respect to the activities of the output neurons $y_j$ to compute the error derivatives with respect to the previous, hidden layer activities $y_i$. In order to do this, we must first look at how the error changes as we change the total input into neuron $j$. This is given by the chain rule.

$$\frac{\partial E}{\partial I_j} = \frac{dy_j}{dI_j} \frac{\partial E}{\partial y_j}$$

Since $y_j$ is just the sigmoid function of $I_j$, $\partial y_j/\partial I_j$ is simply the derivative of the sigmoid function, which is given by:

$$\frac{\partial y_j}{\partial I_j} = y_j(1-y_j)$$

Now we can combine this with our previously derived expression for $\partial E/\partial y_j$ to obtain:

$$\frac{\partial E}{\partial I_j} = - y_j (1-y_j) (t_j - y_j)$$

Since each neuron $i$ in the hidden layer below affects **all** of the output neurons, its *total* effect on the error is given by summing its effects on each of the output neurons. This can be expressed as follows:

$$\frac{\partial E}{\partial y_i} = \sum_{j=1}^{N_j} \frac{\partial I_j}{\partial y_i} \frac{\partial E}{\partial I_j}$$

The first derivative in the sum is the change in an output neuron's total input $I_j$ as you change the activity of a neuron $y_i$ in the previous layer. But this is simply the connection weight between neuron $i$ and $j$ ! So we can express the above error derivative as:

$$\frac{\partial E}{\partial y_i} = \sum_{j=1}^{N_j}w_{ij} \frac{\partial E}{\partial I_j}$$

We can repeat this process to get the error derivatives for every neuron in each layer of the network. So now that we have the error derivatives for all the neurons, *how do we apply the weight update?*

We start by defining the change in error with respect to the connection weight $w_{ij}$. This will tell us how much a particular weight is contributing to the error, and the direction we need to change it to reduce that error. Again using our trusty chain rule:

$$\frac{\partial E}{\partial w_{ij}} = \frac{\partial I_j}{\partial w_{ij}} \frac{\partial E}{\partial I_j}$$

However, the first term on the right is just the activity of neuron $i$ in the layer below, and the second term we computed above. Thus we finally obtain:

$$\frac{\partial E}{\partial w_{ij}} = y_i \frac{\partial E}{\partial I_j}$$

To change the weight $w_{ij}$, we multiply the derivative by a small number $-\epsilon$ and add this to the current value of the weight. This will change the weight a small amount in a direction that decreases the error. The negative sign is there, because we want to *decrease* the error.

$$\Delta w_{ij} = -\epsilon \frac{\partial E}{\partial w_{ij}}$$

##Building a Neural Network

Now that we have an idea of how a neural network is able to change its weights to minimize an error function, let's build one! We're going to use Python's class system for this, which I will give a gentle introduction to. If you're already familiar with classes, the way I present things might seem strange, so bear with me.

First, we want to import the numpy library.

Next, we need to define our sigmoid function and our derivative function.

We have to define our training input and output data. The number of columns in X is the number of input nodes, while the number of columns in Y is the number of output nodes. The number of hidden nodes can be chosen arbitrarily. 

For reproducibility, let's set the random seed.

We must now define our synapse matrices. These are the weights. We fill them with random values between -1 and 1, remembering that they will be updated at each step during the training phase.

We will now train the network.

Error: 0.496410031903
Error: 0.393897980798
Error: 0.0286751008083
Error: 0.00990456559616
Error: 0.00567429108593
Error: 0.0037718217354
Output after training
[[ 0.00250378]
 [ 0.99688199]
 [ 0.99713139]
 [ 0.00223687]]
