### Deep Feedforward Networks

**Deep feedforward networks** (or **feedforward neural networks**, or **multilayer perceptrons** (MLPS)) are the quintessential deep learning models.
The goal of a feedforward network is to approximate some function $f^*$. 

Example: 

For a classifier $y = f^*(x)$ maps an input $x$ to a category $y$. A feedforward network defines a mapping $y=f(x;\theta)$ and learns
the value of the paremeters $\theta$ that result in the best function approximation.

These models are called **feedforward** because there are no **feedback** connections in which outputs of the model are fed back into itself.
When feedforward neural networks are extended to include feedback connections, they are caled **recurrent neural networks**. In commercial applications
the convolutional network used for object recognition from photos are a specialized kind of feedforward network. Feedforward neural networks are called
**networks** since they are typically represented by composing together  many different functions, associated with a directed acyclic graph.

Example:

Given $f^{(1)}$ (**first layer**), $f^{(2)}$ (**second layer**), and $f^{(3)}$ (**third layer**) connected in a chain to form $f(x) = f^{(3)}(f^{(2)}(f^{(1)}(x)))$ with **depth** of 3.


These chain structures are the most commonly used structures of neural networks. The final layer of a feedforward network is called the **outer layer**. During neural network training, we drive
$f(x)$ to match $f^*(x).

Example:

Each example $x$ is accompanied by a label $y \approx f^*(x)$. The learning algorithm must decide how to use those layers to produce the desired output, but the training
data do not say what each indiviudual layer should do. Instead, the learning algorithm must decide how to use these layers to best implement an approximation of $f^*$. Since
the training data does not show the desired output for each of these layers, they are called **hidden layers**.

Finally, these networks are called *neural* because they are loosely inspired by neuroscience. Each hidden layer of the network is typically vector valued. The dimensionality of 
these hidden layers determine the **width** of the model. 

First, training a feedforward network requires making many of the same design decisions as are necessary for a linear model: choosing the optimizer, the cost function, and the form of the output
units. Feedfoward networks have introduced the concept of a hidden layer, and this requires us to choose the **activation functions** that will be used to compute the hidden layer values.
We must also design the architecture of the network, like how many layers, how these layers should be connected, and how many **units** should be in each layer.

# Example: Learning XOR

The XOR (exclusive or) function provides the target function $y = f^*(x)$ that we want to learn. Our model provides a function 
$y = f(x; \theta)$ and our learning algorithm will adapt the parameters $\theta$ to make f as similar as possible to $f^*$. We
want our network to perform correclty on the four points $X = \{ [0,0]^T, [0,1]^T, [1,0]^T, [1,1]^T\}$.

We can treat this problem as a regressino problme and use a mean squared error loss function. Note in practice, MSE is usually not an appropriate cost function for modeling binary data.

Evaluated on our whole training set, the MSE loss function is 
$$
J(\theta) = \frac{1}{4}\sum_{x\in X} (f^*(x) - f(x; \theta))^2
$$

Now we must choose the form of our model, $f(x; \theta)$.  If we choose a linear model, with $\theta$ consisting of $w$ and b. Out model is defined to be

$$
f(x; w, b) = x^Tw + b
$$

We can minimize $J(\theta)$ in closed form with respect to w and b using the normal equations.

After solving the normal equations, we obtain $w = 0$ and $b = \frac{1}{2}$. The linear model simply outputs 0.5 everywhere.

![title](img/picture.png)

Solving the XOR problem by learning a representation. The bold numbers printed on the plot indicate the value that the learned function must output at each
point. (left) A linear model applied directly to the original input cannot implement the XOR function. When $x_1 = 0$, th emodel's output must increase as $x_2$ increases.
When $x_1 = 1$ the model's output must decrease as $x_2$ increases. A linear model must apply a fixed coefficient $w_2$ to $x_2$. The linear model therefore cannot us the value
$x_1$ to change the coefficient on $x_2$ and cannot solve the problem. (right) In the transformed space represented by the feature extracted by a neural network, a linear model can now solve
the problem.

The image above shows hows a linear model is not able to represent the XOR function. One way to solve this problems to use a model that learns a different
feature space in which a linear model is able to represent the solution.

Specifically, we will introduce a simple feedforward network with one hidden layer containing two hidden units, see image below. This feedforward network has a vector of hidden unit $h$
that are computed by a function $f^{(1)}(x;W,c)$. The values of these hidden units are then used as the input for a second layer. Now the ouput layer is still a linear regression model, but now applied
to $h$ rather than x. Thus $h = f^{(1)}(x;W,c)$ and $y = f^{(2)}(x;w,b)$ with $f(x; W, c, w, b) = f^{(2)}f^{(1)}(x))$.

![title](img/picture2.png)

An example of a feedforward network, drawn in two different styles. Specifically, this is the feedforward network we use to solve the XOR example. It has a single hidden layer
containing two units. (left) In this style, we draw every unit as a node in the graph. this style is explicit and unambiguous, but for networks larger than this example,
it can consume too much space. (right) In this style, we draw a node in the graph for each entire vector representing a layer's activations.

Most neural networks do so using affine transformation controlled by learned parameters, followed by a fixed nonlinear function called an activation function. Hence
$h = g(W^Tx +c)$ where $W$ provides the weights of a linear transformation and $c$ the biases. The activation function g is typically chosen to be a function that is applied element-wise
with $h_i = g(x^TW:, i + c_i)$. In modern neural networks, the default recommendation is to use the **rectified linear unit** or ReLU, defined by the activation function
$g(z) max\{0,z\}$

![title](img/picture3.png)

The rectified linear activation function. This acitvation function is the default activation function recommended for use with most feedfoward neural networks.

Now we can specity our complete network as 
$$
f(x; W, c, w, b) = w^T\max\{ 0, W^Tx + c \} + b
$$

### Solution to the XOR problem

We can then specify a solution to the XOR problem. Let
$$
W = 
\begin{bmatrix}
1 & 1 \\
1 & 1
\end{bmatrix}
$$
$$
c = 
\begin{bmatrix}
0  \\
-1 
\end{bmatrix}
$$
$$
c = 
\begin{bmatrix}
1  \\
-2 
\end{bmatrix}
$$

and $b=0$
Now let 
$$
X = 
\begin{bmatrix}
0 & 0 \\
0 & 1 \\ 
1 & 0 \\
1 & 1 \\ 
\end{bmatrix}
$$

The first step in the neural network is to multiply the input matrix by first layer's weight matrix:

$$
XW = 
\begin{bmatrix}
0 & 0 \\
1 & 1 \\ 
1 & 1 \\
2 & 2 \\ 
\end{bmatrix}
$$

Next, we add the bias vector c, to obtain

$$
\begin{bmatrix}
0 & -1 \\
1 & 0 \\ 
1 & 0 \\
2 & 1 \\ 
\end{bmatrix}
$$

As shown in the first image above, they now lie in a space where a linear model can solve the problem.
We finish with multiplying by the weight vector $w$:

$$
\begin{bmatrix}
0 \\
1 \\ 
1 \\
0 \\ 
\end{bmatrix}
$$

The neural network has obtained the correct answer for every example in the batch.