# Neural Networks

## Motivation

Artificial neural networks are models trying to imitate a biological neural network as we can see in the following figure:

![Neural Network](../images/neural_networks.png)

Neural networks have been became popular in the last decade since the great variety of taks that they can address, not just regression and classification challenges. Another reason is that computer are more powerful that a few decades ago, that implies we can work with more data (not only tabular) and also train more complex models.

![NN Cat](../images/cat_nn.png)

## Example

Consider a set of labeled points, with two categories (A and B). The goal is to construct a mapping from $\mathbb{R}^2$ to $\{A, B\}$.

![Example data](../images/example_dataset_nn.png)

The artificial neural network approach uses repeated application of a simple, nonlinear function. We will base our network on the sigmoid function:

$$
\sigma(x) = \frac{1}{1 + e^{-x}}
$$

We may regard $\sigma(x)$ as a smoothed version of a step function, which itself mimics the behavior of a neuron in the brain.

The steepness and location of the transition in the sigmoid function may be altered by scaling and shifting the argument or, in the language of neural networks, by _weighting_ and _biasing_ the input. Let $a$ be a vector produced by the neurons in one layer, then the vector of outputs from the next layer has the form

$$
\sigma \left( W a + b\right)
$$

where $W$ is the matrix of weights and $b$ the vector of $biases$. The number of columns
in W matches the number of neurons that produced the vector a at the previous layer. The number of rows in $W$ matches the number of neurons at the current layer. The number of components in $b$ also matches the number of neurons at the current layer.

Let's consider the following example of an artificial neural network with four layers:

![Example NN with four layers](../images/nn_four_layers.png)

Since the input data has the form $x \in \mathbb{R}^2$, the weights and biases for the second layer may be represented by a matrix $W^{[2]} \in \mathbb{R}^{2 \times 2}$ and a vector $b^{[2]} \in \mathbb{R}^2$, respectively. The output from the second layer has the form

$$
a^{[2]} = \sigma \left( W^{[2]} x + b^{[2]} \right) \in \mathbb{R}^2
$$

The third layer has three neurons, then $W^{[3]} \in \mathbb{R}^{3 \times 2}$, $b^{[2]} \in \mathbb{R}^3$ and the output from the third layer is

$$
a^{[3]} = \sigma \left( W^{[3]} a^{[2]} + b^{[3]} \right) =  \sigma\left(  W^{[3]} \sigma \left( W^{[2]} x + b^{[2]} \right)  + b^{[3]} \right)\in \mathbb{R}^3
$$

Finally, for the fourth (output) layer $W^{[4]} \in \mathbb{R}^{2 \times 3}$ and $b^{[4]} \in \mathbb{R}^2$. The output of the overall network has the form

$$
\begin{aligned}
F(x)
&= \sigma \left( W^{[4]} a^{[3]} + b^{[4]} \right) \\
&= \sigma \left( W^{[4]} \sigma \left( W^{[3]} a^{[2]} + b^{[3]} \right)  + b^{[4]} \right) \\
&= \sigma \left( W^{[4]} \sigma \left( W^{[3]} \sigma \left( W^{[2]} x + b^{[2]} \right) + b^{[3]} \right)  + b^{[4]} \right) \in \mathbb{R}^2
\end{aligned}
$$

Note that the input layer has to have two neurons since we are only working with a input with two features and the output layer also has two neurons but in this case is because there are only two categories.


This neural network define a function $F: \mathbb{R}^2 \to \mathbb{R}^2$ in terms of its 23 parameters (entries in the wright matrices and bias vectors). Without loss of generality, we can encode the categories as vectors,

$$
A : \begin{pmatrix} 1 \\ 0 \end{pmatrix}
\quad
\text{and}
\quad
B : \begin{pmatrix} 0 \\ 1 \end{pmatrix}
$$

We need to optimize over the 23 parameters in order to classify the inputs into categories A or B.

## General Setup

The layers in between the input and output layer are called hidden layers. There is no special meaning to this phrase; it simply indicates that these neurons are performing intermediate calculations. __Deep Learning__ is a loosely defined term which implies that many hidden layers are being used.

The general setup consider $L$ layers, with layers 1 and $L$ being the input and output layers, respectiveley. Suppose that layer $l$, for $l=1, 2, \ldots, L$ contains $n_l$ neurons. So $n_1$ is the dimension of the input data.

$$
\begin{aligned}
a^{[1]} &= x \\
a^{[l]} &= \sigma \left( W^{[l]} a^{[l-1]} + b^{[l]} \right) \quad \text{for } l=2, 3, \ldots, L.
\end{aligned}
$$

Now suppose we have $N$ samples of training data,
$$
\left\{ x^{\{i\}} \right\}_{i=1}^N \subset \mathbb{R}^{n_1}
$$
for which there are given target outputs
$$
\left\{ y^{\{i\}} \right\}_{i=1}^N \subset \mathbb{R}^{n_L}
$$

## Optimization

The quadratic cost function that we widh to minimize hast the form

$$
J\left(W, b\right) = \frac{1}{N} \sum_{i=1}^N \frac{1}{2} \left\lVert y^{\{i\}} - a^{[L]} x^{\{i\}} \right\rVert^2_2
$$

where $W$, $b$ are the set of all the weight matrices and biases vectors, respectively.