# Artifical Neural Networks

## The Neuron

The image below shows a neuron, also known as a node.

![ann1](ann1.png)

The neuron takes a series of input signals and produces a single output signal. In this course, yellow nodes will signify input values, while green will signify hidden nodes and red will signify output values.

<img src="ann2.png" alt="ann2" width="300" style="float:right">

The values in the input layer correspond to a single observation (a single row in the database), measuring the values of multiple independent variables that have been standardized or normalized. Standardization ensures that the variables have mean of 0 and a variance of 1. In normalization you subtract the minimum value and divide by the maximum value to get values between 0 and 1. 

Whether you choose normalization or standardization depends on the scenario. This is necessary to ensure the network works correctly. Further reading on this can be found in Efficient BackProp by Yann LeCun et al (1998).

The output value can be continuous, binary, or categorical. If the output is categorical, we can say that the neuron has multiple outputs, corresponding to the dummy variables of each category.

The connecting lines between nodes in the synapse layer (dendrites) carry weights $w_{1}...w{m}$ which are adjusted as the network is trained.

At the neuron, an activation function $\phi$ is applied to the sum of weights,

\begin{equation}
    \phi\left(\sum_{i=1}^{m} w_i x_i \right).
\end{equation}

Depending on the function and the outcome, the signal is passed on as output to the next node.

## The Activation Function

The activation function can comprise a number different forms. Below, note that $x$ without a subscript indicates the sum of weights.

### Threshold Function

The threshold function or step function, 

\begin{equation}
    \phi(x) = 
    \begin{cases}
        1 \text{ if } x \ge 0, \\
        0 \text{ if } x < 0,
    \end{cases}
\end{equation}

is a very simple function where if the value is less than 0, the function outputs 0, otherwise it's a 1,

![ann3](ann3.png)


### Sigmoid Function

The sigmoid function or logistic function,

\begin{equation}
    \phi(x) = \frac{1}{1 + e^{-x}},
\end{equation}

is smooth, asymptotically approaching $0$ below $x = 0$ and $1$ above $x = 1$. It's very useful in the final layer of the network, especially when the output is a probability.


![ann4](ann4.png)

### Rectifier Function 

The rectifier function,

\begin{equation}
    \phi(x) = \max(x,0),
\end{equation}

is one of the most popular functions for ANNs. See Deep Sparse Rectifier Neural Networks by Xavier Glorot et al (2011) for more on why the rectifier function is so widely used.

![ann5.png](ann5.png)

### Hyperbolic Tangent Function (tanh)

The hyperbolic tangent function,

\begin{equation}
    \phi(x) = \frac{1 - e^{-2x}}{1 + e^{-2x}},
\end{equation}

is similar to the sigmoid function but it asymtotically approaches $-1$ with an increasing negative $x$.

![ann6.png](ann6.png)

### Examples

Assume the dependent variable is binary, $y = 0,1$. We could use a the threshold function, in which case $y=\phi(x)$. Or we could use the sigmoid function to get the probability, $\text{P}(y=1)=\phi(x)$, similar to logistic regression.

In neural networks, frequently the hidden layers will use rectifier functions, while the output layer will use a logistic function.

![ann8.png](ann8.png)


## How do Neural Networks Work?

We're going to examine how NNs work by using a pre-trained example looking at house prices. We'll see how these are actually trained in later sections.


We'll look at an input layer with four independent variables: area in square feet ($x_1$), numbers of bedrooms ($x_2$), distance to the city centre in miles ($x_3$), and age ($x_4$). We want to estimate the price based on these inputs.

If we have no hidden layers, each input feeds in to the output layer with a given weight, and some function is applied to produce the output. For example, the function might just take the sum of these weighted values. This is analogous to multiple linear regression, where the weights correspond to simple coefficients. 

![ann9.png](ann9.png)

However, now assume we have a pretrained neural network with a single hidden layer of five neurons. There are synaptic connections linking every node in adjacent layers. However, the weights on certain neurons will be negligible, such that, in practice, nodes in the hidden layer respond only to a subset of input variables.

![ann10.png](ann10.png)

For example, the first hidden neuron may only respond to $x_1$ and $x_3$, the second to $x_2$ and $x_3$, the third to $x_1$, $x_2$ and $x_4$ the fourth to all inputs and the fifth to $x_4$. These nodes have picked up on particular features of the data. We may theorize on the nature of these linked features using our own human intuition and reasoning for this particular case study.

## How do Neural Networks Learn?

The Neuron, in isolation, is the simplest type of neural network. It is a single layer feed-forward NN, or a perceptron. Now we speak of training, the actual measured value is denoted by $y$, and $\hat{y}$ is the predicted value.

To begin with we look at how training works for a single row of input data. We start with some initial configuration of weights and calculate $\hat{y}$. When $\hat{y}$ is calculated, we calculate the cost function $C(\hat{y}, y)$. We can choose from a number of cost functions, but the most simple is half the squared difference of $y$ and $\hat{y}$,

\begin{equation}
    C = \frac{1}{2}(\hat{y} - y)^2.
\end{equation}

Once $C$ has been calculated, the result is fed back to adjust the weights, with the aim of minimising the cost function through successive iterations.

![ann11.png](ann11.png)

Now let's extend the discussion to multiple rows. There are multiple epochs to training an NN. In the first epoch, we calculate $\hat{y}$ for each row in succession. Then once all rows are calculated we calculate the cost function,

\begin{equation}
    C = \frac{1}{2}\sum_i(\hat{y}_i - y_i)^2.
\end{equation}

This result is then fed back in to the NN to adjust the weights, i.e. all the rows share the same weights, we're not dealing with a different NN for each row. 

![ann12.png](ann12.png)

Now we begin the reiterate the process with the adjusted weights, for all rows, until we've minimised $C$. This process is called back propagation.

[Link to further reading on different cost functions](https://stats.stackexchange.com/questions/154879/a-list-of-cost-functions-used-in-neural-networks-alongside-applications)
