<a href="https://colab.research.google.com/github/cyrus2281/notes/blob/main/MachineLearning/Tensorflow/LazyProgrammer/02_Tensorflow_Artificial_Neural_Networks.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Content

>[Content](#scrollTo=66dOmkdh_3Bx)

>[Artificial Neural Networks](#scrollTo=YiVkndZ5Up77)

>>[Feedforward Neural Networks](#scrollTo=2gY3gD-9ZwPi)

>>[Activation Functions](#scrollTo=cMpmO42gcjJe)

>>>[Identity](#scrollTo=zwnoUVqU-2jN)

>>>[Sigmoid](#scrollTo=Z-Oyiikzx7EF)

>>>>[Standardization](#scrollTo=7xO7_bnsxEI5)

>>>[Tanh](#scrollTo=SbGf15PRx9Pa)

>>>>[The Vanishing Gradient Problem](#scrollTo=BJNIGlf-yj6d)

>>>[ReLU](#scrollTo=oKkr1zJFz0jZ)

>>>>[Dead Neuron Problem](#scrollTo=98grfxfc2T-A)

>>>[LReLU](#scrollTo=SwJuY83B2yom)

>>>[ELU](#scrollTo=rFwSAjlK4T87)

>>>[Softplus](#scrollTo=IYJ9QMSE5HWO)

>>>[BRU](#scrollTo=N1r-a2oC7E8H)

>>>>[Multiclass Classification](#scrollTo=4bhxjQMB7-GO)

>>>[Softmax](#scrollTo=vJeAKLLx8_5-)



# Artificial Neural Networks

## Feedforward Neural Networks

Multiple layers of interconntected neurons (logistic regressions). Left side is input, right side is output. and signal goes from left to right. Hence is called a "feedforward" neural network.

\

Different neurons look for different features.

\

The same inputs can be fed to multiple different neurons, each calculating something different (more neurons per layer).

Neurons in one layer can act as inputs to another layer.

A line: $ax+b$

A neuron: $\sigma(w^Tx+b)$

If there's $M$ neurons, and we call each $z_j$:

$
z_j = \sigma(w^T_jx+b_j), \text{ for } j = 1\dots M \\
$

Vectorize the neurons (Single Layer):

$$
z = \sigma(W^Tx+b)
$$

- $z$ is a vector of size M
- $x$ is a vector of size D
- $W$ is a matrix of size D×M
- $b$ is a vector of size M

\

Vectorize the neurons (Multiple Layer):

We show layer as a superscript

$
z^{(1)} = \sigma(W^{(1)T}x+b^{(1)}) \\
z^{(2)} = \sigma(W^{(2)T}z^{(1)}+b^{(2)}) \\
z^{(3)} = \sigma(W^{(3)T}z^{(2)}+b^{(3)}) \\
$

$$
p(y=1 | x) = \sigma(W^{(L)T}z^{(L-1)}+b^{(L)}) \\
$$

- $L$ number of layers





If we don't need the final sigmoid (eg for prediction of non-binary values), we can just remove it

$$
\hat y = W^{(L)T}z^{(L-1)}+b^{(L)} \\
$$

(This just looks like a linear regression)

## Activation Functions


### Identity

Identity function is just a function that returns it's input.

$$
f(x) = x
$$

This is used for regression models

### Sigmoid

$$
\sigma(a) = \frac{1}{1+\exp(-a)}
$$

- Maps input to 0…1
- Non-linear

![sigmoid](https://miro.medium.com/v2/resize:fit:640/format:webp/1*Xu7B5y9gp0iL5ooBj7LtWw.png)



#### Standardization

We don't want to have one input in the range 1..5 million, and another in the range 0...0.0001, We prefer inputs centered around 0 and approx. the same range.

The sigmoid output goes between 0 and 1, center is 0.5. Its output therefore can never be centered around 0.

The concept of "uniformity", the output of the sigmoid (previous layer) is the input to the next layer.



### Tanh

The solution to this issue, is another activation function similar to sigmoid, but centered around zero, which is **Hyperbolic Tangent (tanh)**.

$$
\tanh(a) = \frac{\exp(2a)-1}{\exp(2a)+1} \\
$$

![tanh](https://images.squarespace-cdn.com/content/v1/5acbdd3a25bf024c12f4c8b4/1524687495762-MQLVJGP4I57NT34XXTF4/TanhFunction.jpg?format=1500w)

#### The Vanishing Gradient Problem

Gradient of multiple layers

$$
\frac{\partial J}{\partial W^{(1)}} =
\frac{\partial J}{\partial z^{(L)}}
\frac{\partial z^{(L)}}{\partial z^{(L-1)}} \cdots
\frac{\partial z^{(2)}}{\partial z^{(1)}}
\frac{\partial z^{(1)}}{\partial W^{(1)}}
$$

Output

$$
\sigma(\dots \sigma( \dots (\sigma \dots(\dots))))
$$

We end up multiplying by the derivative of the sigmoid over and over again.


Derivative of sigmoid is very tiny number! Maximum value is only 0.25


![Derivative of sigmoid](https://miro.medium.com/v2/resize:fit:1400/format:webp/1*6A3A_rt4YmumHusvTvVTxw.png)


Multiplying many small numbers, only result in an even smaller number. E.g. $0.25^5 \approx 0.001$

This results in *the further back we go in a neural network, the smaller the gradient becomes*. This is known as **Vanishing Gradient Problem**.


the training algorithm is to take small steps in the direction of the gradient. If the gradient is nearly zero that means the update to the weights is also nearly zero. The end result is that weights close to the input of the neural network are almost not trained at all.

### ReLU

Solution was simple. Don't use activation functions that have vanishing gradients.

For example, **Rectifier Linear Unit (ReLU)**.

$$
R(z) = \max(0, z) \\
$$

![relu](https://miro.medium.com/v2/resize:fit:1100/format:webp/1*DfMRHwxY1gyyDmrIAd-gjQ.png)



#### Dead Neuron Problem

*Problem*, The ReLU doesn't have a "vanishing" gradient, but the gradient in the left half is already vanished!

This phenomenon is knonw as the "**dead neuron**" Probelm.

That fact that the right side does not vanish, seems to be enough for the majority of deep learning experiments.


### LReLU

Leaky ReLU is one solution to the Dead Neuron problem. It has a small Positive slope for negative inputs

$$
f(x) = \left\{\begin{matrix}
x & x \ge 0
\\
\alpha x & x < 0
\end{matrix}\right.
$$
- α is a small number like 0.1

- Slope is always positive
- It's a non-linear function

![lrelu](https://www.researchgate.net/publication/340644173/figure/fig4/AS:880423093686272@1586920631085/5a-Graph-of-the-LReLU-function-5b-Graph-of-gradient-of-LReLU-function.ppm)


### ELU

Exponential linera unit (ELU) is another solution to dead neuron problem which has a more steadily decreasing value on the left side.

Authors claim it speeds up learning and leads to higher accuracy.

Negative values possible, the mean can be zero (unlike ReLU)


$$
f(x) = \left\{\begin{matrix}
x & x > 0
\\
a (e^x -1) & x \le 0
\end{matrix}\right. \\
$$

![elu](https://ml-cheatsheet.readthedocs.io/en/latest/_images/elu.png)

### Softplus

Softplus is another option which is very similar because you're taking the log of the exponent looks very linear when the input is reasonably large.

$$
f(x) = \log(1 + e^x) \\
$$

![softplus](https://www.researchgate.net/profile/Hussam-Lawen/publication/336602359/figure/fig2/AS:814832592908288@1571282637278/The-Softplus-function-ln1-exp-compared-to-max0.ppm)

- Note: both softplus and ELU have vanishing gradients on the left - but we already know it's not too problematic because ReLU works.

- Softplus and ReLU are in the range $0\dots\infty$
  - they can't be centered around 0
  - Does it matter?


### BRU

Bionodal Root Unit (BRU) are activation functions that are more modeled like real neurons.

More at: https://arxiv.org/pdf/1804.11237.pdf



#### Multiclass Classification

If we have K possible outcomes, then we should have K output nodes. i.e. $a^{(L)}$ is a vector of size k.

$$
a^{(L)} = W^{(L)T}z^{(L-1)}+b^{(L)} \\
$$

So we need a probability distribution over K distinct values.
- They must be non negative (>= 0) with an upper limit of 1 (<= 1)
- All probabilites must some to 1

Requirement 1:
$$
p(y=k|x) \ge 0
$$

Requirement21:
$$
\sum_{k=1}^K p(y=k|x) = 1
$$







### Softmax

Softmax is a function that does exactly this.

$$
p(y= k|x) =
\frac{\exp(a_k)}{\sum_{j=1}^K\exp(a_j)}
$$