# Artifical Neural Networks

## The Neuron

The image below shows a neuron, also known as a node.

![ann1](ann1.png)

The neuron takes a series of input signals and produces a single output signal. In this course, yellow nodes will signify input values, while green will signify hidden nodes and red will signify output values.

<img src="ann2.png" alt="ann2" width="300" style="float:right">

The values in the input layer correspond to a single observation (a single row in the database), measuring the values of multiple independent variables that have been standardized or normalized. Standardization ensures that the variables have mean of 0 and a variance of 1. In normalization you subtract the minimum value and divide by the maximum value to get values between 0 and 1. 

Whether you choose normalization or standardization depends on the scenario. This is necessary to ensure the network works correctly. Further reading on this can be found in Efficient BackProp by Yann LeCun et al (1998).

The output value can be continuous, binary, or categorical. If the output is categorical, we can say that the neuron has multiple outputs, corresponding to the dummy variables of each category.

The connecting lines between nodes in the synapse layer (dendrites) carry weights $w_{1}...w{m}$ which are adjusted as the network is trained.

At the neuron, an activation function $\phi$ is applied to the sum of weights,

\begin{equation}
    \phi\left(\sum_{i=1}^{m} w_i x_i \right).
\end{equation}

Depending on the function and the outcome, the signal is passed on as output to the next node.

## The Activation Function

The activation function can comprise a number different forms. Below, note that $x$ without a subscript indicates the sum of weights.

### Threshold Function

The threshold function or step function, 

\begin{equation}
    \phi(x) = 
    \begin{cases}
        1 \text{ if } x \ge 0, \\
        0 \text{ if } x < 0,
    \end{cases}
\end{equation}

is a very simple function where if the value is less than 0, the function outputs 0, otherwise it's a 1,

![ann3](ann3.png)


### Sigmoid Function

The sigmoid function or logistic function,

\begin{equation}
    \phi(x) = \frac{1}{1 + e^{-x}},
\end{equation}

is smooth, asymptotically approaching $0$ below $x = 0$ and $1$ above $x = 1$. It's very useful in the final layer of the network, especially when the output is a probability.


![ann4](ann4.png)

### Rectifier Function 

The rectifier function,

\begin{equation}
    \phi(x) = \max(x,0),
\end{equation}

is one of the most popular functions for ANNs. See Deep Sparse Rectifier Neural Networks by Xavier Glorot et al (2011) for more on why the rectifier function is so widely used.

![ann5.png](ann5.png)

### Hyperbolic Tangent Function (tanh)

The hyperbolic tangent function,

\begin{equation}
    \phi(x) = \frac{1 - e^{-2x}}{1 + e^{-2x}},
\end{equation}

is similar to the sigmoid function but it asymtotically approaches $-1$ with an increasing negative $x$.

![ann6.png](ann6.png)

### Examples

Assume the dependent variable is binary, $y = 0,1$. We could use a the threshold function, in which case $y=\phi(x)$. Or we could use the sigmoid function to get the probability, $\text{P}(y=1)=\phi(x)$, similar to logistic regression.

In neural networks, frequently the hidden layers will use rectifier functions, while the output layer will use a logistic function.

![ann8.png](ann8.png)


## How do Neural Networks Work?

We're going to examine how NNs work by using a pre-trained example looking at house prices. We'll see how these are actually trained in later sections.


We'll look at an input layer with four independent variables: area in square feet ($x_1$), numbers of bedrooms ($x_2$), distance to the city centre in miles ($x_3$), and age ($x_4$). We want to estimate the price based on these inputs.

If we have no hidden layers, each input feeds in to the output layer with a given weight, and some function is applied to produce the output. For example, the function might just take the sum of these weighted values. This is analogous to multiple linear regression, where the weights correspond to simple coefficients. 

![ann9.png](ann9.png)

However, now assume we have a pretrained neural network with a single hidden layer of five neurons. There are synaptic connections linking every node in adjacent layers. However, the weights on certain neurons will be negligible, such that, in practice, nodes in the hidden layer respond only to a subset of input variables.

![ann10.png](ann10.png)

For example, the first hidden neuron may only respond to $x_1$ and $x_3$, the second to $x_2$ and $x_3$, the third to $x_1$, $x_2$ and $x_4$ the fourth to all inputs and the fifth to $x_4$. These nodes have picked up on particular features of the data. We may theorize on the nature of these linked features using our own human intuition and reasoning for this particular case study.

## How do Neural Networks Learn?

The Neuron, in isolation, is the simplest type of neural network. It is a single layer feed-forward NN, or a perceptron. Now we speak of training, the actual measured value is denoted by $y$, and $\hat{y}$ is the predicted value.

To begin with we look at how training works for a single row of input data. We start with some initial configuration of weights and calculate $\hat{y}$. When $\hat{y}$ is calculated, we calculate the cost function $C(\hat{y}, y)$. We can choose from a number of cost functions, but the most simple is half the squared difference of $y$ and $\hat{y}$,

\begin{equation}
    C = \frac{1}{2}(\hat{y} - y)^2.
\end{equation}

Once $C$ has been calculated, the result is fed back to adjust the weights, with the aim of minimising the cost function through successive iterations.

![ann11.png](ann11.png)

Now let's extend the discussion to multiple rows. We calculate $\hat{y}$ for each row in succession. Then once all rows are calculated we calculate the cost function,

\begin{equation}
    C = \frac{1}{2}\sum_i(\hat{y}_i - y_i)^2.
\end{equation}

This result is then fed back in to the NN to adjust the weights, i.e. all the rows share the same weights, we're not dealing with a different NN for each row.

![ann12.png](ann12.png)

Now we begin to reiterate the process with the adjusted weights, for all rows, until we've minimised $C$. This process is called back propagation.

[Link to further reading on different cost functions](https://stats.stackexchange.com/questions/154879/a-list-of-cost-functions-used-in-neural-networks-alongside-applications)


## The Curse of Dimensionality

We've discussed that the result of the cost function is fed back to adjust the weights, but we haven't clarified how this is done. 

Let's take again the example of the simple single input perceptron.

![single_input_perceptron.png](single_input_perceptron.png)

Naively, we might first attempt to train it using a brute force approach. In this manner, we could try, say, 1000 different weights and find the one that minimises the cost function.

![brute_force.png](brute_force.png)

This might be suitable when tuning a single weight. However, as the number of weights increases, we run in to the Curse of Dimensionality. Let's look at the pre-trained example we dealt with before.

![25_weights](25_weights.png)

In this example there are $25$ weights, $20$ between the input and hidden layer, and $5$ between the hidden and output layer. So, if we want to test $1000$ different tunings for each weight, that's $1000^{25} = 10^{75}$ different combinations. To test a single combination requires multiple floating point operations (flop). Let's naively assume, however, that we reduce this to a single flop. The world's fastest super computer can perform at $93$ flop/s. This means that it'll take 

\begin{equation}
    \frac{10^{75} \text{ flop} }{ 93 \times 10^{15} \text{ flop/s}} \approx 10^{58} \text{ s} \approx 3 \times 10^{50} \text{ years}
\end{equation}

to test all combinations; that's longer than the age of the universe. So, how can we overcome this?


## Gradient Descent

The trick to overcoming the curse of dimensionality is gradient descent. Here you start at some random (or best-guess) point in $x$. You calculate the cost function at this point and differentiate this function to find the gradient of the slope. Now a point is chosen in $x$ some distance away in the direction of decreasing $C$, i.e. if the slope is negative, you next select a higher point in $x$, if positive, you select a lower point in $x$.

![slope](slope.png)

This process is repeated until you find a minimum in $C$.

![1d_descent](1d_descent.png)

In 2d this can be visualised like this.

![2d_descent](2d_descent.png)

And in 3d like this.

![3d_descent](3d_descent.png)


## Stochastic Gradient Descent

In the previous example, the cost function had a convex form; the only minimum was the global minimum, so whichever initial $x$ is chosen, the same local minimum will be found. However, the cost function may have a much more complex form, with multiple local minima, such as the one below.

![local_minimum](local_minimum.png)

In this case, gradient descent has selected a non-optimal local minimum. To deal with these functions, we can make use of stochastic gradient descent (SGD). SGD does not require a convex function to find the global minimum. How does the methodology differ?

In batch gradient descent (BGD) (the first method described), the cost function is a function of all the rows, i.e. all the rows are processed each time the cost function is assessed. So, the weights are updated only after all rows are taken into account. In SGD, the cost function is evaluated on a row-by-row basis, and in a randomly chosen order; the weights are updated for each row.

![batch_vs_gradient](batch_vs_gradient.png)

This helps to avoid the problem where you get stuck in a local minimum, as the cost function fluctuates much more highly. It's also a much faster approach. BGD is a deterministic method, while SGD, is, naturally, stochastic.

For more further reading on gradient descent methodology see [A Neural Network in 13 lines of Python (Part 2 - Gradient Descent) by Andrew Trask (2015)](https://iamtrask.github.io/2015/07/27/python-network-part2/).


## Backpropagation

In the previous sections we saw how information was entered into the input layer and propagated forward to compute $\hat{y}$. 

![forward_propagation](forward_propagation.png)

![back_propagation](back_propagation.png)

These are then compared to the $y$ of the training set and the errors are propagated back to adjust the weights.

One might think that each of the weights needs to be individually adjusted, however, due to the mathematics of backpropagation, all the weights may be adjusted at once.

The mathematics of this is covered in the further reading, [Neural Networks and Deep Learning by Michael Nielson (2015)](http://neuralnetworksanddeeplearning.com/chap2.html).

## Steps to training ANNs

In summary the steps to train an ANN are:
1. Randomly initialise the weights to small numbers close to 0 (but not 0).
2. Input the first observation of your dataset in the input layer, each feature in one input node.
3. Forward-Propagation: from left to right, the neurons are activated in a way that the impact of each neuron's activation is limited by the weights. Propagate the activations until getting the predicted result $\hat{y}$.
4. Compare the predicted result $\hat{y}$ to the actual result $y$. Measure the generated error.
5. Backpropagation: from right to left, the error is backpropagated. Update the weights according to how much they are responsible for the error. The learning rate decides by how much we update the weights.
6. Repeat steps 1 to 5 and update the weights after each observation (Reinforcement Learning). Or: repeat steps 1 to 5 but update the weights after a batch of observations (Batch Learning).
7. When the whole training set has been passed through the ANN, that makes an epoch. Redo more epochs.

## Importing the libraries

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Importing the Dataset

In [2]:
dataset = pd.read_csv('Churn_Modelling.csv')
dataset.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


This dataset details the customer details of a fictional bank. The bank is experiencing unusually high rate of churn (customer loss). They've taken a sample of 10,000 customers to try and diagnose the problem.

We have installed Keras using

```sh
conda install -c conda-forge keras
```

Keras wraps Tensorflow and Theano.