# Shallow Neural Network


### Table of Contents

* [1. Neural Network Representation](#chapter1)
    * [1.1 Representation](#section_1_1)
    * [1.2  Computing a Neural Network output's](#section_1_2)
    * [1.3 Vectorizing across multiple examples](#section_1_3)
* [2. Activation functions](#chapter2)
    * [2.1 Overview of the activation functions](#section_2_1)
    * [2.2 Why do you need non-linear activation function](#section_2_2)
    * [2.3 Derivatives of activation functions](#section_2_2)
* [3. Gradient Descent for one hidden layer Neural network](#chapter3)
* [4. Random initialization](#chapter4)
* [5. Recap](#chapter5)

# 1. Neural Network Representation <a class="anchor" id="chapter1"></a>

## 1.1 Representation <a class="anchor" id="section_1_1"></a>

<center><img src="images/03-shallow neural network/NN-representation.PNG" width = "400px"></center>

- We have the input features, x1, x2, x3 stacked up vertically. And this is called the <b>input layer</b> of the neural network.
- Then there's another layer of circles. And this is called a <b>hidden layer</b> of the neural network. 
- The final layer here is formed by, in this case, just one node. And this single-node layer is called the <b>output layer</b>, and is responsible for generating the predicted value

> This neural network has <b>Two layers</b>. Indeed we don't count the input layer in a neural network.

## 1.2 Computing a Neural Network output's <a class="anchor" id="section_1_2"></a>

A node in a layer does two steps of computation:

- 1: It computes z = w.T x + b
- 2: It computes the activation function 

<center><img src="images/03-shallow neural network/single-node.PNG" width = "300px"></center>

So each node in neural network and so in the hidden layer will compute these two steps.

By convention the notation are the following:

$$ a^{[l]}_i $$
- a represents the activation of the layer (input layer a[0] | hidden layer a[1] | output layer a[2])
- l means the layer l
- i means the node i

> Lets compute the activation of the first node in the hidden layer

<center><img src="images/03-shallow neural network/hidden-layer-node1.PNG" width = "300px"></center>

The representation of X and W are:
$$X =\begin{bmatrix} x_1 \\ x_2 \\ x_3  \end{bmatrix} $$
$$W^{[1]}_1 =\begin{bmatrix} w_{11} \\ w_{12} \\ w_{13}  \end{bmatrix} $$

We have two steps to compute in this node:

$$ Z^{[1]}_1 = W^{[1]T}_1 X + b^{[1]}_1$$
$$ a^{[1]}_1 = \sigma(Z^{[1]}_1) $$ 

> We repeat this method on each node on the hidden layer

<center><img src="images/03-shallow neural network/hidden-layer-node2.PNG" width = "300px"></center>

$$ 
    \begin{cases}
    Z^{[1]}_1 = W^{[1]T}_1 X + b^{[1]}_1 \\
    Z^{[1]}_2 = W^{[1]T}_2 X + b^{[1]}_2 \\
    Z^{[1]}_3 = W^{[1]T}_3 X + b^{[1]}_3 \\
    Z^{[1]}_4 = W^{[1]T}_4 X + b^{[1]}_4 
    \end{cases}
$$

$$ 
    \begin{cases}
    a^{[1]}_1 = \sigma(Z^{[1]}_1) \\
    a^{[1]}_2 = \sigma(Z^{[1]}_2) \\
    a^{[1]}_3 = \sigma(Z^{[1]}_3) \\
    a^{[1]}_4 = \sigma(Z^{[1]}_4) 
    \end{cases}
$$

 When we're vectorizing, one of the rules of thumb that might help you navigate this, is that while we have different nodes in the layer, we'll stack them vertically.

$$Z^{[1]} = W^{[1]} X + b^{[1]} = \begin{bmatrix} ---w^{[1]T}_1---\\ ---w^{[1]T}_2--- \\ ---w^{[1]T}_3--- \\ ---w^{[1]T}_4---\end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \\ x_3  \end{bmatrix} 
+ \begin{bmatrix} b^{[1]T}_1\\ b^{[1]T}_2 \\ b^{[1]T}_3 \\ b^{[1]T}_4 \end{bmatrix} 
= \begin{bmatrix} Z^{[1]}_1\\ Z^{[1]}_2 \\ Z^{[1]}_3 \\ Z^{[1]}_4\end{bmatrix}$$

$$ a^{[1]} = \sigma(Z^{[1]})= \begin{bmatrix} a^{[1]}_1\\ a^{[1]}_2 \\ a^{[1]}_3 \\ a^{[1]}_4\end{bmatrix}$$

We have complete the computation of the hidden layer. Now we need to realize the same process and the output layer.


> Output layer

<center><img src="images/03-shallow neural network/output-node.PNG" width = "300px"></center>

There is only one node in the output layer, so the notation will be:
$$W^{[2]} =\begin{bmatrix} w_{21} \\ w_{22} \\ w_{23}  \end{bmatrix}^T = \begin{bmatrix} w_{21} & w_{22} & w_{23}  \end{bmatrix} $$
$$ Z^{[2]} = W^{[2]}a^{[1]} + b^{[2]}$$ 

$$ a^{[2]} = \sigma(Z^{[2]}) $$ 


Neural Network with one hidden layer -  Equations for one example:
$$Z^{[1]} = W^{[1]} X + b^{[1]} $$

$$ a^{[1]} = \sigma(Z^{[1]})$$

$$ Z^{[2]} = W^{[2]}a^{[1]} + b^{[2]}$$ 

$$ a^{[2]} = \sigma(Z^{[2]}) $$ 

## 1.3 Vectorizing across multiple examples <a class="anchor" id="section_1_3"></a>

Now we want to vectorize the previous equations of our 2 layers neuron network.

> We consider an input X with m examples. Each example have n features :

$$X=\begin{bmatrix} .. & .. & .. & ..\\ .. & .. & .. & .. \\ X^{(1)} & X^{(2)} & .. & X^{(m)}  \\.. & .. & .. & ..\\ .. & .. & .. & .. \end{bmatrix} \in (n \times m)$$

> So if we consider a hidden layer with k nodes:

 $$ W^{[1]} = \begin{bmatrix} ---w^{[1]T}_1---\\ ---w^{[1]T}_2--- \\ ... \\ ---w^{[1]T}_{k-1}--- \\ ---w^{[1]T}_{k}---\end{bmatrix} \in (k \times n) $$ 

 $$ b^{[1]} = \begin{bmatrix} b^{[1]T}_1\\ b^{[1]T}_2\\ ... \\ b^{[1]T}_{k-1} \\ b^{[1]T}_{k}\end{bmatrix}  \in (k \times 1)$$

The result for the hidden layer will be: 

$$Z^{[1]} = W^{[1]} X + b^{[1]} =\begin{bmatrix} h_{11} & .. & .. & h_{1m}\\ .. & .. & .. & .. \\ Z^{[1](1)} & Z^{[1](2)} & .. & Z^{[1](m)}   \\.. & .. & .. & ..\\ .. & .. & .. & .. \end{bmatrix} \in (k \times m)$$

- Vertically we have the hidden units
- horizontally we have the training examples

    - h11 is the first hidden unit (node 1) of the first example
    - h1m is the first hidden unit (node 1) of the m-last example

We apply the activation:

$$a^{[1]} = \sigma(Z^{[1]})=\begin{bmatrix} .. & .. & .. & ..\\ .. & .. & .. & .. \\ a^{[1](1)} & a^{[1](2)} & .. & a^{[1](m)}   \\.. & .. & .. & ..\\ .. & .. & .. & .. \end{bmatrix} \in (k \times m)$$

> Now we do the same process on the output layer

The output layer has one node so:

$$ W^{[2]} = \begin{bmatrix} w^{[2]}_{1} & .. & .. & .. & w^{[2]}_{k} \end{bmatrix} \in (1 \times k) $$ 

$$Z^{[2]} = W^{[2]} a^{[1]} + b^{[2]} =\begin{bmatrix} w^{[2]}_{1} & .. & .. & .. & w^{[2]}_{k} \end{bmatrix} 
\begin{bmatrix} .. & .. & .. & ..\\ .. & .. & .. & .. \\ a^{[1](1)} & a^{[1](2)} & .. & a^{[1](m)}   \\.. & .. & .. & ..\\ .. & .. & .. & .. \end{bmatrix}  + b^{[2]}$$

$$Z^{[2]} =\begin{bmatrix} Z^{[2]}_{1} & .. & .. & .. & Z^{[2]}_{k} \end{bmatrix} $$

$$ a^{[2]} = \sigma(Z^{[2]}) $$

# 2. Activation Functions <a class="anchor" id="chapter2"></a>

## 2.1 Overview of the Activation Functions <a class="anchor" id="section_2_1"></a>

When you build a Neural network, one of the choices you get to make is what activation function to use in the hidden layers and the output units. 

So far, we've just have been using the sigmoid activation function, but sometimes other choices can work much better.

> Sigmoid function:

<center><img src="images/03-shallow neural network/sigmoid_function.PNG" width = "400px"></center>

> tanh activation function:

An activation function that almost always work better than the sigmoid function is the hyperbolic tangent function.

<center><img src="images/03-shallow neural network/tanh_function.PNG" width = "300px"></center>

The values of the tanh function are between minus 1 and plus one. For the hidden units it turns out the tanh function almost always work better than the sigmoid function because the mean of the activations that come out of the hidden layer are closer to having a zero mean. When you train a learning algorithm, you might center the data and have your data have zero mean using a tan h instead of a sigmoid function. Kind of has the effect of centering your data so that the mean of your data is close to zero rather than maybe 0.5. And this actually makes learning for the next layer a little bit easier.

Example:
- For a binary classification, you use the tanh function for the hidden layer. But for the output layer you prefer to use the sigmoid function into order to have y_pred values between 0 and 1.

<b>One of the downsides of both the sigmoid function and the tan h function</b> is that if z is either very large or very small, then the gradient of the derivative of the slope of this function becomes very small. So if z is very large or z is very small, the slope of the function either ends up being close to zero and so this can slow down gradient descent.

> Relu function: Rectified Linear Unit

<center><img src="images/03-shallow neural network/relu_function.PNG" width = "400px"></center>

And the advantage of the <b>reLU function</b> is the derivative of the activation function, the slope of the activation function is very different from zero. And so in practice, using the reLU activation function, your neural network will often <b>learn much faster</b> than when using the tan h or the sigmoid activation function.

> Leaky reLU activation function:

<center><img src="images/03-shallow neural network/leaky_relu_function.PNG" width = "400px"></center>

## 2.2 Why do you need Non-linear activation functions ? <a class="anchor" id="section_2_2"></a>

It turns out that if you use a linear activation function or alternatively, if you don't have an activation function, then no matter how many layers your neural network has, all it's doing is just computing a linear activation function. So you might as well not have any hidden layers.

## 2.3 Derivatives of activation functions <a class="anchor" id="section_2_3"></a>

When you implement backpropagation for your neural network, you need to either compute the slope or the derivative of the activation functions.

> Derivative of sigmoid :

$$ g(z) = \frac{1}{1+\exp^{-z}}$$

$$ \frac{\partial g}{\partial z} = \frac{\exp^{-z}}{(1+\exp^{-z})^2} = g(z)(1-g(z))$$

> Derivative of tanh :

$$ g(z) = tanh = \frac{\exp^{z}-\exp^{-z}}{\exp^{z}+\exp^{-z}}$$

$$ \frac{\partial g}{\partial z} = 1 - \frac{(\exp^{z}-\exp^{-z})^2}{(\exp^{z}+\exp^{-z})^2} = 1 - g(z)^2$$

> Derivative of reLU :

$$ g(z) = max(0,z) $$
$$\frac{\partial g}{\partial z} = \begin{cases}
                                    0 \ \ if \ \ z<0 \\
                                    1 \ \ if \ \ z>0

                                  \end{cases}$$

> Derivative of leaky reLU :

$$ g(z) = max(0.01z,z) $$
$$\frac{\partial g}{\partial z} = \begin{cases}
                                    0.01 \ \ if \ \ z<0 \\
                                    1 \ \ if \ \ z>0

                                  \end{cases}$$

# 3. Gradient Descent for one hidden layer Neural network <a class="anchor" id="chapter3"></a>

<center><img src="images/03-shallow neural network/one-hidden-layer.png" width = "400px"></center>

> Input nx features, m examples:

$$X = \begin{bmatrix}  X^{(1)} &  X^{(2)} & .. & .. &  X^{(m)} \end{bmatrix}  \in  \mathbb{R^{n_X \times m}}$$  

> Parameters and Loss function:

- n[2] = 1 (-> 1 output unit)

$$ W^{[1]} \in (n_{[1]} \times n_X) $$ 
$$ b^{[1]} \in (n_{[1]} \times 1) $$ 
$$ W^{[2]} \in (n_{[2]} \times n_{[1]}) $$ 
$$ b^{[2]} \in (n_{[2]} \times 1) $$ 


> FORWARD PROPAGATION : 

$$
\begin{cases}
    Z^{[1]} = W^{[1]} X + b^{[1]} \\
    A^{[1]} = g^{[1]}(Z^{[1]})  \\
    Z^{[2]} = W^{[2]} A^{[1]} + b^{[2]} \\
    A^{[2]} = g^{[2]}(Z^{[2]})
\end{cases}
$$


> BACKWARD PROPAGATION :

$$
\begin{cases}
    dZ^{[2]} =  (A^{[2]} - Y) \\
    dW^{[2]} = \frac{1}{m} (A^{[2]} - Y)A^{[1]T} \\
    db^{[2]} = \frac{1}{m} \sum (A^{[2]} - Y)    \\
    dZ^{[1]} = W^{[2]T}dZ^{[2]} * g^{[1]'}(Z^{[1]}) \\
    dW^{[1]} = \frac{1}{m} dZ^{[1]} X^T\\
    db^{[1]} = \frac{1}{m} \sum  dZ^{[1]}
\end{cases}
$$

# 4. Random initialization <a class="anchor" id="chapter4"></a>

If you initialize the neural network  equal to 0, then this hidden unit and this hidden unit are completely identical. Sometimes you say they're completely symmetric, which just means that they're completing exactly the same function.

> so long as w is initialized randomly, you start off with the different hidden units computing different things. And so you no longer have this symmetry breaking problem

# 5. Recap <a class="anchor" id="chapter5"></a>

- The tanh activation is almost always better than sigmoid activation function for hidden units because the mean of its output is closer to zero, and so it centers the data, making learning complex for the next layer.

- Sigmoid outputs a value between 0 and 1 which makes it a very good choice for binary classification. You can classify as 0 if the output is less than 0.5 and classify as 1 if the output is more than 0.5.

- Suppose you have built a neural network. You decide to initialize the weights and biases to be zero : 
    - Each neuron in the first hidden layer will perform the same computation. So even after multiple iterations of gradient descent each neuron in the layer will be computing the same thing as other neurons. 


- Logistic Regression doesn't have a hidden layer. If you initialize the weights to zeros, the first example x fed in the logistic regression will output zero but the derivatives of the Logistic Regression depend on the input x (because there's no hidden layer) which is not zero. So at the second iteration, the weights values follow x's distribution and are different from each other if x is not a constant vector. 


- You have built a network using the tanh activation for all the hidden units. You initialize the weights to relative large values, using np.random.randn(..,..)*1000. What will happen? 
    - This will cause the inputs of the tanh to also be very large, thus causing gradients to be close to zero. The optimization algorithm will thus become slow.