# Neural Networks 1 - Introduction and Forward Propagation

## 1. Introduction and Intuition

The first thing that you will see in any machine learning tutorial is the following image, and told that it is a mystic black box that, when you give some input, it makes some predictions or has an output. 
<img src="https://ml-cheatsheet.readthedocs.io/en/latest/_images/neural_network_w_matrices.png" width=75%>
So lets work on demystifying this blackbox, and breaking this image into pieces and understanding what each component means, and by the end you must have a pretty good grasp of how a neural network works. Let's start with the most fundamental component of this blackbox and build our way up.

### 1.1 Neurons

A Neuron can be though of as a single unit of logistic Regression. It takes in an array of inputs, computes it's dot product with the weight vector of that neuron and add the bais, and finally apply an activation function to it, to return an output. 

<img src="https://ml-cheatsheet.readthedocs.io/en/latest/_images/neuron.png" width=50%>

Now, the inputs to each neuron can be the Data you feed to the network or the output from other neurons. 

### 1.2 Layers

These neurons, "stacked in a column" make a layer of the neural network. Stacked in a column implies that the weight matrices, and bias of each of the neurons are stacked one on op of the other. This is what gives us the weight matrix of that layer, denoted by $W$.

Say the size of the input vector to the layer, and hence EACH NEURON is $n_0$, and the number of neurons in this layer is $n_1$
1. Shape of $w$, the weight matrix of a single neuron: $(1, n_0)$

So, if each of these $n_1$ row vectors are stacked on top of each other, we get a matrix $W$. This matrix has $n_1$ rows, and $n_0$ rows. Hence, 
2. The shape of $W$, the weight matrix of the layer is $(n_1, n_0)$

Similarly, whereas the bias for the logistic regression classifier was a scalar, the bias of the layer is a column vector ($n_1$ biases stacked on top of each other) of shape $(n_1, 1)$. Note that, the bias will be represented here by $b$ as well. The reason for this will be clear soon.

The equaation of propagation of each layer is an intuitive extension of the one we looked at for logistic Regression. 

Expressing the input of the layer by $a_0$ containing $n_0$ features, and output of the layer by $a_1$:

$$ a_1 = W \cdot a_0 + b $$

From the rules of dot product, it is clear that $a_1$ has shape $(n_1, 1)$

> Note: From now, each neuron will also be referred to as a unit of a layer. For example, a layer might have 3 units, meaning it has 3 neurons.

> You might even see people call the input vector as a layer as well. This leads to a more fundamental discussion of whether the weights are what make a layer, or the features. If that does not make sense, then don't worry. The concepts will be explained in detail, just remember that there is a reason why the input vector to a Neural Network is called an "Input Layer"

### 1.3 The Neural Network as a Classfier

A Neural Network is an ordered arrangement of several of these layers with different Number of units, activation functions, etc. The number of layers in a Neural Network, and the number of units in almost all layers is the decision of the programmer.

> Note: All the various numbers that can be tweaked by the programmer, to make a model better are called **hyperparameters** (as compared to the parameters, the weights and the biases). Some examples of hyperparameters that you have come across till now are learning rate, $\alpha$, the number of Layers in a Network and the number of units in a layer. 

The job of a Neural Network is to compute the approximate probabilities of a given input belonging to a particular class. If there are 5 classes that it can belong to (eg: "cat", "dog", "bat", "ant", and others) then the final output of the netwrosk might look something like this:

Index | Label | Probability
---|---|---|
0 | "cat" | 0.12
1 | "dog" | 0.04
2 | "bat" | 0.7
3 | "ant" | 0.07
4 | "others" | 0.7

From this we can infer that, $P(\text{image is of a cat} | \text{image} = x) = 0.12 $

> Going with the Logistic Regression Analogy, the Sigmoid Activation Function won't suffice, since it returns just a single Scalar whereas we need a vector of probabilities. For a multiclass classifier we will use the natural extension of the Sigmoid Function, or the softmax function. There are several other activation functions as well, all of which we will look at soon.

This kind of indexing is called one hot encoding where the Labels are converted into a column vector, where the value of class which it belongs to is 1, rest are 0. Indicating that the probability of the input belonging to that class is 1, and rest is 0. The one-hot encoding for this Input might look like $[[0.],[0.], [1.], [0.], [0.]]$ indicating that the input actually belongs to class given by index 2, or "bat"


The trivial neural network would be a binary logistic regression classifier, with just a single layer (the output layer) and two possible classes.

Taking one more step ahead, lets add one more layer, between the input Layer, and the output Layer.

The Input Vector $x$, will be fed into the layer H. Let's call it's output $a_h$.These output of the Hidden Layer, can also be considered a feature vector, and will be used as input to the next layer. 

> Now, since this $a_h$ is "hidden" from the  user, it is called a hidden layer

This $a_h$ will be fed into the Output Layer, which gives you the final output or the probabilities of the the $x$ belonging to the different classes.

This, finally leads us to the picture we saw earlier: 
<img src="https://ml-cheatsheet.readthedocs.io/en/latest/_images/neural_network_w_matrices.png">

Now, the Neural Network described above has a single hidden layer. However, most Networks will have several Hidden layers with various sizes.

Idiomatically, the Number of layers in a Neural Network is given by the Number of Hidden Layers + the Output Layer, and is denoted by $L$. The Input Layer is, by extension the $0^{th}$ layer.

Hence, the Neural Network we described is a 2 Layer Neural Network, or a Neural Network with 1 Hidden Layer.

> The number of layers, the number of units per layer, etc describe the **Architecture of the Neural Network**. Other items which we have not yet covered that also describe the architecture include Activation Function of each layer, or the type of layer among several others.

## 2. Forward Propagation in a 2 Layer Network


Now that we know what a Neural Network does, let us look at how it does it. 

Assume that we have a 2 Layer Model, which takes an $n^x$ dimensional feature vector and there are 2 classes, ie binary classification.

If the number of units in each layer is given by  $n^{[l]}$, where $l$ denoted the layer number, then

1. Input Layer has, $n^{[0]} = n_x$ units
2. Hidden Layer has $n^{[1]}$ units, 
3. Output Layer has $n^{[2]} = 1$ units

Let the input vector be $x$ and have the shape $(n_x, 1)$ or $(n^{[0]}, 1)$. 

The input vector is then dotted with the Weight Matrix of the hidden Layer, and Bias added to get the Linear Output of the Hidden Layer. After which we apply the Activation function - Sigmoid. In equation form, 

$$ z^{[1]} = W^{[1]} \cdot x + b^{[1]} $$
$$ a^{[1]} = S(z^{[1]}) $$

Where, the superscipt [1] denotes the first layer. 

> Since the Hidden Layer, or Layer 1 has $n^{[1]}$ units, it must have $n^{[1]}$ rows as well. And for the dot priduct to make sense, it must have $n^{[0]}$ columns. Therfore, $W^{[1]}$ has the shape $(n^{[1]}, n^{[0]})$ 
> $b^{[1]}$ has the shape $(n^{[1]}, 1)$


Now, we use the $a^{[1]}$ as input to the 2nd Layer, the Output Layer.
Using a similar algorithm, we get:

$$ z^{[2]} = W^{[2]} \cdot a^{[1]} + b^{[2]} $$
$$ \hat y = a^{[2]} = S(z^{[2]}) $$

> As we did for the Hidden Layer, we can derive the shape of $W^{[2]}$ as $(n^{[2]}, n^{[1]})$ and $b^{[2]}$ has the shape $(n^{[2]}, 1)$


These steps, complete the Process of forward Propagation in the network. For this week, we will be looking into 2 Layer Networks only. We will generalise these results to L Layer Network in a Subsequent Week.


### 2.1 Initialization of the Parameters

In the Logistic Regression Classifier, we initialized the weights with a zero vector, and the bias with zero and over time optimized their value. However, the same thing can not be apllied to a Neural Network. 

tldr; Zero or Constant Initialization fails to break symetry of neurons

Let us assume that we have a 2 Layer Network, with weights and biases of all layers are initialized to a constant Number, say $c$. 

<img src="https://i.stack.imgur.com/agyRr.png">

Now, notice that each unit in the Hidden Layer will get the exact same signal. (Each Neuron does computation on the entire Input Vector)

i.e., the Each Hidden Unit will do the following computation:

$$ z^{[1]}_j = W^{[1]}_{ij} \cdot x + b^{[1]}_{i} $$
$$ a^{[1]}_j = S(z^{[1]}_j) $$ 

$ W^{[1]}_{ij} $ denotes the $ j^{th} element in the $ weight vector of the $ i^{th} $ Neuron in the 1st Layer (ie Hidden Layer)

So, since $ W^{[1]}_{ij} = c,  \forall i \in [1, n^{[1]}], \forall j \in [1, n^{[0]}] $
So, each Unit is doing the EXACT SAME CALCULATION. Eg, if c = 1 then each Neuron is just calculating the sum of the input vector. And so all the neurons, output the same signal.

This renders your entire Neural Network model, with several thousand connections or "synapses", with a glorified Logistic Regresssion Model.

> For an amazing visualization of this exact phenomena, visit [Initializing Neural Networks by deeplearning.ai](https://www.deeplearning.ai/ai-notes/initialization/)

So, in essence we will always use random initialization for the Weights. Biases can still be initialzed as zero, since W has initialized randomly, and symtetry already broken.

> Other more sofisticated methods of Initialization exist as well, eg: Xavier Initialization, feel free to explore them on your own. 

### 2.2 Activation Functions 

Activation functions live inside neural network layers and modify the data they receive before passing it to the next layer. Activation functions give neural networks their power — allowing them to model complex non-linear relationships. By modifying inputs with non-linear functions neural networks can model highly complex relationships between features. 

Till now the only activation function you have used is the sigmoid function. However, there are several other functions which give much better accuracy when used in the hidden Layer(s). 

Activation Functions Typically have the following Properties:

1. Non-Linear: In linear regression we’re limited to a prediction equation that looks like a straight line. This is nice for simple datasets with a one-to-one relationship between inputs and outputs, but what if the patterns in our dataset were non-linear? (e.g. $x^2$, $sin$, $log$). To model these relationships we need a non-linear prediction equation. Activation functions provide this non-linearity.

2. Continuously differentiable: To improve our model with gradient descent, we need our output to have a nice slope so we can compute error derivatives with respect to weights. 

3. Fixed Range: Activation functions typically squash the input data into a narrow range that makes training the model more stable and efficient.

### 2.2.1 Rectified Linear Unit (ReLU)

The Rectified Linear Unit is a very simple Activation Functions. For all positive Real Numbers it is the number itself, and for everything else it is 0.

In Mathematical Form,

Function | Derivative
---|---|
$$
\begin{equation}
    R(z) =
    \begin{cases}
        z \text{, if } z \gt 0\\
        0 \text{, if } z \le 0\\
     \end{cases}
\end{equation}
$$ | $$
\begin{equation}
    R'(z) =
    \begin{cases}
        1 \text{, if } z \gt 0\\
        0 \text{, if } z \lt 0\\
     \end{cases}
\end{equation}
$$
<img src="https://ml-cheatsheet.readthedocs.io/en/latest/_images/relu.png" width=250px> | <img src="https://ml-cheatsheet.readthedocs.io/en/latest/_images/relu_prime.png" width=250px>


1. ReLU has been shown to perfrom very well for Deep Neural Networks.
2. It is way less computationally cheaper than Sigmoid or other activation functions we'll soon see. 
3. ReLU can be used only in the Hidden Layers of a neural Network. 
4. ReLU is that it has a range of $[0, \infty)$. This means that it can blow up the activation of a layer. 
