
The last notebook focused on introducing the perceptron algorithm. While the perceptron is impractical to use in practice, the core idea of the perceptron algorithm carried over to other neural network algorithms.

This notebook introduces the idea of the multi-layer Perceptron and shows how adding hidden layers to the Perceptron makes the algorithm better. The notebook is broken down into following way- 

1) The need for a multi-layer perceptron <br>
2) Basic structure of a multi-layer perceptron <br>
3) Components of a multi-layer perceptron <br>
4) Description of weights <br> 

Unlike the last notebook, this notebook will focus more on the intuition on the multi-layer perceptron (we shall refer to this as MLP) without any practical examples (these will come in later notebooks). We need to understand many pieces of MLP before we can look at practical example and throughout the next few notebooks we will go over those pieces. 

This notebook will draw heavily from the Perceptron notebook in terms of examples and notation. 

## The need for a multi-layered perceptron

The perceptron is an algorithm that takes each row of data from a dataset and performs a weighted sum on it and uses an activation function to assign a class to each row. The perceptron is a linear classifier; the decision boundary that it generates is a straight line that divides the feature space into two regions. 


<p>
    <img src="perceptron_plus_boundary.jpg" width=800 alt>
    <b>Figure 1</b>
</p>

The perceptron generates a linear decision boundary like the one shown in the bottom left panel of Figure 1. Frequently, this type of decision boundary is insufficient to classify the dataset. Below are three examples where a linear decision boundary does not work. 

<p>
    <img src="nonlinear_three.jpg" width=1000 alt>
    <b>Figure 2</b>
</p>

The three examples above depict the plausible spread of data in real datasets. For the first case in figure 2, a circular decision boundary is required to separate the two classes. In the second case, a linear boundary may work, but the accuracy will be low due to the data distribution. In the third case, two lines may be able to separate out the two classes but might not provide good accuracy. Clearly, in all three cases, a linear boundary will fail to achieve a good boundary of separation for the two classes. How can one generate a decision boundary for such scenarios?

## Basic structure of a multilayer perceptron

A better prediction model can be built by extending the perceptron by adding a "hidden layer" between the input layer and the output layer. 

 It has been mathematically shown that by introducing a hidden layer, a neural network can approximate a wide variety of functions [1]. Adding a hidden layer introduces extra weight parameters in the neural network. Take a look at the figure below. 
 
 <p>
    <img src="mlp_simple.jpg" width=800 alt>
    <b>Figure 3</b>
</p>

In figure 3, a single hidden layer is added between the input layer and the output layer. Each line that connects two nodes represents a weight parameter. Before adding the hidden layer, we had two weight parameters $\text{w}_1$ and $\text{w}_2$.
Adding the hidden layer introduces four new weight parameters. Each new weight parameter is represented by the line that connects the input layer nodes to the hidden layer nodes. Each node in the input layer connects to all the nodes in the hidden layer.  Another way to look at it is that each node in the hidden layer takes information from each of the input layer nodes. 

Similar to the Perceptron architecture, learning the weights means that the neural network has "learned" how to identify the two different classes in the dataset. The procedure is similar to the perceptron- we initialize the weights, then take a row of data and check the network's prediction. Based on the error in the prediction, we adjust the weights. Rather than learning the three weights we were learning the last time ( including the bias), now we will have to learn 3 +  4 (number of connections between input nodes and hidden layer nodes)+ 1 (new bias term) = 8  weights. 

The details of training a MLP are different from that of a single layer Perceptron. For this we need to understand each of the elements the MLP. These are- 

1) Weights <br>
2) Activation funcitons  <br>
3) Loss functions  <br>
4) Gradient descent   <br>
5) Backpropagation  <br>

The addition of a hidden layer, and the potential of adding more hidden layers, means that we need to incorporate that information while defining weights and input parameters. The MLP can use many types of activation functions. We will introduce the sigmoid and the Relu, which are commonly used for classification and regression tasks. The results from the network are fed into a loss function that helps us quantify the error in result. The Perceptron algorithm had a loss function too. We skipped it in the interest of keeping things simple.  However, the loss function forms an essential component of understanding MLP since there are multiple task dependent loss functions to choose from based on classification or regression. Once the loss has been calculated, Gradient descent and backpropagation are then used to adjust the weights of the neural network to produce better results. Backpropagation was very crucial to training MLPs to the forefront since it leads to significant improvement in prediction capabilities. 

All of this is not possible in a single notebook; hence the content will be divided into multiple notebooks. This notebook will cover part 1 - Weights. The second notebook will include activation functions and loss functions, and the third notebook will tackle gradient descent and backpropagation. In the fourth notebook, we will combine all these ideas and provide practical examples of training a multi-layer perceptron using the MNIST dataset. 

The first part we will tackle will be how do we define weights in a multi-layer perceptron. 




### Defining weights in an MLP


Defining weights for the Perceptron was straight forward.  We had two input nodes, a bias term and an output node. These three connected the input layer to the output layer. Hence,  our weighted sum had three terms. One for each node of the input layer. One for each feature in our dataset and a bias term. Life was easy. Adding a hidden layer not only changes the number of weights but also the number of weighted sums that we must write down. 

To write down the weighted sums, let us start by looking at the connections between the input layer and the hidden layer. We will call the input layer as layer 0 and the hidden layer as layer 1. We will define some notation to identify the weights as shown in figure 4-  
 
 <p>
    <img src="weightdef.jpg" width=800 alt>
    <b>Figure 4</b>
</p> 

It is useful to have an upper index, which refers to the highest layer number and two lower indices referring to the starting and ending node in each layer[2]. Another example would be- 

$$ {\text{w}^{1}}_{22} \rightarrow \text{A connection that connects the second node of layer 0 to the second node of layer 1}  $$ 

Why did we define this complicated looking expression? We need to write down a weighted sum for each of the nodes in the hidden layer, and for that purpose, we need to identify which weights are part of the weighted sum. For example, the weighted sum expression for the first node in the hidden layer becomes-


$$ \text{weighted sum for node 1} = \text{w}^{1}_{11} \times  \text{feature 1} + \text{w}^{1}_{21} \times  \text{feature 2}    $$

In a similar vein to the weight, rather than writing down "weighted sum for node 1 in layer 1" we can write it in the following notation. 

$$  z^1_{1} \rightarrow  \text{weighted sum for node 1 for layer 1 } $$

Since the hidden layer has 2 nodes, there are two weighted sums to write down.

$$ z^1_{1} = \text{w}^{1}_{11} \times  \text{feature 1} + \text{w}^{1}_{21} \times  \text{feature 2}    $$
<br>
$$ z^1_{2} = \text{w}^{1}_{12} \times  \text{feature 1} + \text{w}^{1}_{22} \times  \text{feature 2}    $$

 If you ever get confused about the upper and lower indices, then refer to figure 4 for clarification. It takes some practice, but you will settle into the notation.

Once the weighted sums are calculated, we put them into the activation function just like we did for the Perceptron. So we will define a new term that represents the output of this operation. 

$$ a^1_1 = \text{activation function}\big( z^1_{1} \big) $$ 

So the term on the left represents the result of using the activation function on the weighted sum. Similar to the weighted sum, the upper index represented the layer number and the lower number represents the node number in that layer. Figure 5 highlights the similarities between the notation. 

<p>
    <img src="z_a_notation.jpg" width=800 alt>
    <b>Figure 5</b>
</p> 

 Here is an example of how the weighted sum and the activation function output would look like from the Perceptron notebook. 

<p>
    <img src="all_weights_mlp.jpg" width=1000 alt>
    <b>Figure 6</b>
</p> 

Figure 6 shows how you would calculate the weighted sum and the activation function output for a row of data from a dataset. The sample dataset is what you see in (A). You can find this example in the notebook on the Perceptron. The purple rectangle represents a single row of data. The row of data is fed to the input nodes of the network. Since the dataset has four features, there are four input nodes. The input nodes in (B), (C), and (D) take a row of data using which they calculate a weighted sum. An activation function then maps the weighted to sum to activation outputs $a^1_1, a^1_2, a^1_3 $. There are three activation outputs since there are three hidden layer nodes. Keep in mind that the weights that you see in (B), (C) and (D) are arbitrary. In practice, one would pull values from a normal distribution before we start training, and as the neural network would train, the weights change due to gradient descent and backpropagation. 

Another way to think about figure 6 is that (B), (C) and (D) are similar to individual Perceptrons. These individual Perceptrons pick up different patterns in the data and pass it on to the output layer. 

Going back to figure 4, here is how the weights from the hidden layer are connected to the output layer. 


<p>
    <img src="hidden_output.jpg" width=800 alt>
    <b>Figure 7</b>
</p> 

The activation output at the output layer is defined as a weighted sum as of activation outputs from the hidden layer. This is a direct extension of what we did between the input layer and the hidden layer. The difference is that between the input layer and the hidden layer, we calculated the weighted sum using the input values. In contrast, between the hidden layer and the output layer, the weighted sum is using the activation outputs. 

Adding another hidden layer will increase the number of parameters that the neural network needs to learn. Suppose we add another 2 node hidden layer to figure 4. Then we will have the following weighted sums

<p>
    <img src="two_hidden_layers.jpg" width=800 alt>
    <b>Figure 7</b>
</p> 

For first hidden layer - 

$$ a^1_{1} = \text{Activation function}\big( \text{w}^{1}_{11} \times  \text{feature 1} + \text{w}^{1}_{21} \times  \text{feature 2} \big) $$

$$ a^1_{2} = \text{Activation function}\big( \text{w}^{1}_{12} \times  \text{feature 1} + \text{w}^{1}_{22} \times  \text{feature 2} \big) $$

For second hidden layer - 

$$ a^2_{1} = \text{Activation function}\big( \text{w}^{2}_{11} \times  \text{a}^1_1 + \text{w}^{2}_{21} \times \text{a}^1_2 \big) $$


$$ a^2_{2} = \text{Activation function}\big( \text{w}^{2}_{21} \times  \text{a}^1_1 + \text{w}^{2}_{22} \times \text{a}^1_2 \big) $$


For the output layer - 

$$ a^3_{1} = \text{Activation function}\big( \text{w}^{3}_{11} \times  \text{a}^2_1 + \text{w}^{3}_{21} \times \text{a}^2_2 \big) $$

I omitted writing the weighted sum as z to save space, but the terms inside the brackets are the weighted sums. 

The complexity of a model increases by adding more hidden layers since it has access to more weight parameters, which means that the neural network can pick up various levels of patterns in the system. The "deep" part of deep learning refers to the creation of more parameters in a neural network by adding many hidden layers. As you will see later on in the course, there will be networks with thousands, millions, and even billions of weight parameters. Regardless of the structure of the network, the goal is always the same- find the optimal weights to make accurate predictions on the dataset. If the network has the optimal weights for a dataset, then it has learned to identify the patterns in the dataset.


Understanding how the weights are set up is part 1 of the story. In the next notebook, we shall look at in detail activation functions and loss functions. In talking about the Perceptron algorithm, we paid little attention to them, but we cannot proceed forward without a clear understanding of what they do. Regardless of what deep learning library you use- Tensorflow, Pytorch, Caffe, etc. You will need to define these elements to train a network. 




# References
 [1] Kurt Hornik (1991) "Approximation Capabilities of Multilayer Feedforward Networks", Neural Networks, 4(2), 251–257. doi:10.1016/0893-6080(91)90009-T
 
 [2] Much of the notation has been adopted from http://neuralnetworksanddeeplearning.com/chap2.html