## Neural networks overview

- In logistic regression: $x\rightarrow z\rightarrow a\rightarrow L(a,y)$
- In neural networks with one hidden layer: $x\rightarrow z_{1}\rightarrow a_{1}\rightarrow z_{2}\rightarrow a_{2}\rightarrow L(a_{2}, y)$
<img src="screenshot/7.PNG" style="width:600px;height:350px;">

## Neural network representation

- A neural network contains input layer (layer 0), hidden layer(s) and output layer. Here we will define the neural networks which have one hidden layer (which means we caNNot see that layers in the training set)
- Using $^{[0]}$, $^{[1]}$, $^{[2]}$ to represent the $0^{th}$ layer, the $1^{th}$ layer, and the $2^{th}$ layer.
- Using $_{1}^{[1]}$, $_{2}^{[1]}$ to represent the first node (neuron) and the second node (neuron) of the first layer. 
<img src="screenshot/8.PNG" style="width:600px;height:350px;">
<img src="screenshot/9.PNG" style="width:600px;height:350px;">
- Shapes of the variables and parameters
    - $w^{[1]}$ is the matrix of the first hidden layer, and it has a shape (\# of hidden neurons, $n_{x}$)
    - $b^{[1]}$ is the vector of the first hidden layer, and it has a shape (\# of hidden neurons, 1)
    - $z^{[1]}$ is the result of the equation $z^{[1]}=w^{[1]}*x+b$, it has a shape (\# of hidden neurons, 1)
    - $a^{[1]}$ is the result of the equation $a^{[1]}=sigmoid(z^{[1]})$, it has a shape (\# of hidden neurons, 1)
    - $w^{[2]}$ is the matrix of the second hidden layer, and it has a shape (1, \# of hidden neurons)
    - $b^{[2]}$ is the matrix of the second hidden layer, and it has a shape (1, 1)
    - $z^{[2]}$ is the result of the equation $z^{[2]}=w^{[2]}*a^{[1]}+b^{[2]}$, it has a shapre (1, 1)
    - $a^{[2]}$ is the result of the equation $a^{[2]}=sigmoid(z^{[2]})$, it has a shapre (1, 1)

## Computing a neural network's output

## Vectorizing across multiple examples

- Pseudo code of the for loop for the forward propagation for the $2$ layers NN:
$$
\begin{aligned}
&for\ i=1\ to\ m:\\
&\quad z^{[1](i)}=w^{[1]}x^{(i)}+b^{[1]}\\
&\quad a^{[1](i)}=sigmoid(z^{[1](i)})\\
&\quad z^{[2](i)}=w^{[2]}a^{[1](i)}+b^{[2]}\\
&\quad a^{[2](i)}=sigmoid(z^{[2](i)})\\
\end{aligned}
$$

- Pseudo code of the vectorization for the forward propagation for the $2$ layers NN (**the number of columns is always $m$**):
$$
\begin{aligned}
&z^{[1]}=w^{[1]}x+b^{[1]}\\
&a^{[1]}=sigmoid(z^{[1]})\\
&z^{[2]}=w^{[2]}a^{[1]}+b^{[2]}\\
&a^{[2]}=sigmoid(z^{[2]})\\
\end{aligned}
$$

## Activation functions

### sigmoid() (use for output layer in the case with a binary outcome):
- Sigmoid() function: $A=g(z)=\frac{1}{1+\exp(-z)}$
- Derivation: $g^{'}(z)=\frac{\exp(-z)}{(1+\exp(-z))^{2}}=g(z)*(1-g(z))$
- sigmoid() can lead us to gradient problem where the updates are low
- sigmoid() activation function range is $[0,1]$

### tanh() 
- tanh() function: $A=g(z)=\frac{\exp(z)-\exp(-z)}{\exp(z)+\exp(-z)}$
- Derivation: $g^{'}(z)=1-g(z)^{2}$
- tanh() activation function range is $[-1,1]$
- It turns out that the tanh() activation usually works better than sigmoid() activation function for hidden units because the mean of its output is closer to zero, and so it centers the data better for the next layer. 
- The disadvantage of sigmoid/tanh: if the input is too small or too high, the slope will be near zero, which will cause us the gradient decent problem.

### ReLU
- ReLU function: $A=\max(0,z)$, meaning that if $z$ is negative the slope is $0$ and if $z$ is positive the slope remains linear.
- Derivation: $g^{'}(z)=\{\begin{aligned}0\quad if\ z<0\\1\quad if\ z\ge0\end{aligned}$
- For the case with a binary output, use the sigmoid() as the output activation and ReLU as activation functions for other layers.

### Leaky ReLU
- Leaky ReLU function: $A=\max(0.01z,z)$
- Derivation: $g^{'}(z)=\{\begin{aligned}0.01\quad if\ z<0\\1\quad if\ z\ge0\end{aligned}$
- Leaky ReLU activation function is a modified version of ReLU.

**There are no guidelines for making a choice when conducting NN (\# of hidden layers, \# of neurons in each hidden layer, learning rate, activation functions, etc.).**

## Why do we need nonlinear activation functions:

- If we removed the activation function from our algorithm, we have sorts of the linear activation function.
- A linear activation function will output a linear combination of input
- We may use a linear activation function in one place--in the output layer if the output is real numbers (regression problem). But even in this case if the output value is non-negative, we could use ReLU instead.

## Gradient descent for the neural networks

- Algorithm
$$
\begin{aligned}
&Repeat:\\
&\quad compute\ predictions\ (\hat{y}^{(i)},i=1,2,\dots,m)\\
&\quad get\ derivatives: dw_{1},db_{1},dw_{2},db_{2}\\
&\quad update:\\
&\quad\quad w^{[1]}=w^{[1]}-\alpha*dw^{[1]}\\
&\quad\quad b^{[1]}=b^{[1]}-\alpha*db^{[1]}\\
&\quad\quad w^{[2]}=w^{[2]}-\alpha*dw^{[2]}\\
&\quad\quad b^{[2]}=b^{[2]}-\alpha*db^{[2]}
\end{aligned}
$$
where
$$
\begin{aligned}
&dz^{[2]}=a^{[2]}-y\\
&dw^{[2]}=dz^{[2]}*a^{[1]T}\\
&db^{[2]}=dz^{[2]}\\
&dz^{[1]}=w^{[2]T}dz^{[2]}*a^{[1]'}z^{[1]}\\
&dw^{[1]}=dz^{[1]}*a^{[0]T}=dz^{[1]}*x^{T}\\
&db^{[1]}=dz^{[1]}
\end{aligned}
$$

<img src="screenshot/10.PNG" style="width:600px;height:350px;">

## Random initialization

- In logistic regression, it is not important to initialize the weights randomly, while in NN we have to initialize them randomly.
- If we initialize all the weights with zeros in NN, it won't work (though it's fine to initializing bias with zero)
    - all hidden units will be completely identical (symmetric)--compute precisely the same function
    - on each gradient descent iteration, all the hidden units will always update the same
- We need small values because in sigmoid(or tanh), for example, if the weight is too large you are more likely to end up even at the very start of training with very large values of $z$, which causes the tanh() or sigmoid() activation function to be saturated, thus slowing down the learning process. If you don't have any sigmoid() or tanh() activation functions throughout your NN, this is less of an issue. 