# Introduction
Neural networks, also known as artificial neural networks (ANNs) or simulated neural networks (SNNs), are a subset of machine learning and are at the heart of deep learning algorithms. Their name and structure are inspired by the human brain, mimicking the way that biological neurons signal to one another.

# Activation Functions


## What is an Activation Function?
The concept of activation functions in the neural network is inspired by the biological neurons of the human brain. In the biological brain, neurons are fired or activated based on certain inputs from their previous connected neurons. The entire brain is a complex network of these biological neurons that are activated in a complex manner and help the functioning of the entire body.<br><br>
In the artificial neural network, we have mathematical units known as artificial neurons that are connected with each other. And these neuron units are fired using the activation functions which is nothing but a mathematical function itself.<br><br>
The below diagram explains this concept and comparison between the biological neuron and artificial neuron.
<center>
<img src="https://raw.githubusercontent.com/fabnancyuhp/DEEP-LEARNING/main/IMAGE/DENSENN/Activation_Function.png" width="50%" />
</center>

## ReLU Activation Function :
The Rectified Linear Unit has become very popular in the last few years. It computes the function $f(x)=\max(0,x)$. In other words, the activation is simply thresholded at zero.
<center>
<img src="https://raw.githubusercontent.com/fabnancyuhp/DEEP-LEARNING/main/IMAGE/DENSENN/relu.png" width="20%" />
</center>

### Leaky ReLU Activation Function: 
Leaky ReLUs are one attempt to fix the “dying ReLU” problem. Instead of the function being zero when x < 0, a leaky ReLU will instead have a small positive slope (of 0.01, or so). That is, the function computes $f(x)=\alpha x\mathbb{1}_{x<0}(x)+x\mathbb{1}_{x\geq 0}(x)$
<center>
<img src="https://raw.githubusercontent.com/fabnancyuhp/DEEP-LEARNING/main/IMAGE/DENSENN/leakyrelu.png" width="20%" />
</center>

## Sigmoid Activation Function
The sigmoid Activation Function has the mathematical form $\sigma(x)=\frac{1}{1+\exp(-x)}$. The sigmoid activation function takes a real-valued number and “squashes” it into range between 0 and 1. In particular, large negative numbers become 0 and large positive numbers become 1.
<center>
<img src="https://raw.githubusercontent.com/fabnancyuhp/DEEP-LEARNING/main/IMAGE/DENSENN/sigmoid.png" width="20%" />
</center>

## Tanh Activation Function
The Tanh Activation Function squashes a real-valued number to the range $[-1, 1]$. Like the sigmoid neuron, its activations saturate, but unlike the sigmoid neuron its output is zero-centered. Therefore, in practice the tanh non-linearity is always preferred to the sigmoid nonlinearity. Also note that the tanh neuron is simply a scaled sigmoid neuron, in particular the following holds $\tanh(x) =2\sigma(2x)-1=\frac{1-\exp(-2x)}{1+\exp(-2x)} $.
<center>
<img src="https://raw.githubusercontent.com/fabnancyuhp/DEEP-LEARNING/main/IMAGE/DENSENN/tanh.png" width="20%" />
</center>


## SoftMax Activation Function
Softmax function produces a probability distribution as a vector whose value range between (0,1) and the sum equals 1. Given a n-sized vector $x=(x_{1},\dots,x_{n})$, a sofmax layer outputs a n-sized probability vector such that the ith componant <br>
Given a n-sized vector x, a softmax layer outputs a n-sized probability vector such that the ith component is 
$$\frac{\exp(x_{i})}{\sum_{i=1}^{n}\exp(x_{i})} $$

# Neural Network architectures
This section is inspired from the following website https://cs231n.github.io/neural-networks-1/.<br><br>
Neural networks are built with a Layer-wise organization. 
**Neural networks is a graph of neurons**. Neural Networks are modeled as collections of neurons that are connected in an acyclic graph. In other words, the outputs of some neurons can become inputs to other neurons. Cycles are not allowed since that would imply an infinite loop in the forward pass of a network. Instead of an amorphous blobs of connected neurons, Neural Network models are often organized into distinct layers of neurons. For regular neural networks, the most common layer type is the **fully-connected layer** in which neurons between two adjacent layers are fully pairwise connected, but neurons within a single layer share no connections. Below are two example Neural Network topologies that use a stack of fully-connected layers:
<div class="fig figcenter fighighlight">
  <img src="https://raw.githubusercontent.com/fabnancyuhp/DEEP-LEARNING/main/IMAGE/DENSENN/neural_net.jpeg" width="40%" />
  <img src="https://raw.githubusercontent.com/fabnancyuhp/DEEP-LEARNING/main/IMAGE/DENSENN/neural_net2.jpeg" width="50%" style="border-left: 1px solid black;" />
  <div class="figcaption"><b>Left:</b> A 2-layer Neural Network (one hidden layer of 4 neurons (or units) and one output layer with 2 neurons), and three inputs. <b>Right:</b> A 3-layer neural network with three inputs, two hidden layers of 4 neurons each and one output layer. Notice that in both cases there are connections between neurons across layers, but not within a layer.</div>
</div>

**Naming conventions.** Notice that when we say N-layer neural network, we do not count the input layer. Therefore, a single-layer neural network describes a network with no hidden layers (input directly mapped to output). In that sense, you can sometimes hear people say that logistic regression or SVMs are simply a special case of single-layer Neural Networks. You may also hear these networks interchangeably referred to as “Artificial Neural Networks” (ANN) or “Multi-Layer Perceptrons” (MLP) or “Dense neural networks”. Many people do not like the analogies between Neural Networks and real brains and prefer to refer to neurons as units.<br><bR>
**Output layer.** Unlike all layers in a Neural Network, the output layer neurons most commonly do not have an activation function (or you can think of them as having a linear identity activation function). This is because the last output layer is usually taken to represent the class scores (e.g. in classification), which are arbitrary real-valued numbers, or some kind of real-valued target (e.g. in regression).<br><br>
**Sizing neural networks.** The two metrics that people commonly use to measure the size of neural networks are the number of neurons, or more commonly the number of parameters. Working with the two example networks in the above picture: 
* The first network (left) has 4 + 2 = 6 neurons (not counting the inputs), [3 x 4] + [4 x 2] = 20 weights and 4 + 2 = 6 biases, for a total of 26 learnable parameters.
* The second network (right) has 4 + 4 + 1 = 9 neurons, [3 x 4] + [4 x 4] + [4 x 1] = 12 + 16 + 4 = 32 weights and 4 + 4 + 1 = 9 biases, for a total of 41 learnable parameters.

Each of these little circles can be one of those activation functions we discussed in the previous section.<br>
Each arrow between two layers corresponds a weight.

# Forward propagation


We have a n-sized training sample $(x^{(1)},\dots,x^{(n)})$ we also call batch. N is called the batch size. $x^{(i)}\in \mathbb{R}^{q}$ refers to the ith sample. 
* The input of this neural network is a $\mathbb{R}^{q}$ vector. 
* The output of this neural network is a $\mathbb{R}^{p}$ vector.
* We use the following notations: $n^{[0]}$ is the input-layer number of neurons, $n^{[1]}$ is the neurons number of the first hidden-layer,...,$n^{[m-1]}$ is the neurons number of the hidden-layer m-1, $n^{[m]}$ is the neurons number of the output-layer. In fact, $n^{[0]}=q$ and $n^{[m]}=p$.
* The activation function of the jth layer is $\sigma^{[j]}$. The input-layer has no activation function.
* $W^{[i]}$ is the weights matrix corresponding to the layer i. $W^{[i]}\in \mathbb{R}^{n^{[i]}\times n^{[i-1]}}$. The rows number of W is equal to the neurons number of the current layer. The columns number of W is equal to the neurons number of the previous layer. There is no weights matrix for the input-layer.
* $b^{[i]}$ refers to the biases of the hidden layer i. $b^{[i]}\in \mathbb{R}^{n^{[i]}}$ 
* $a^{[i]}$ refers to the activation values of the hidden layer i. The $a^{[i]}$'s are also called hidden activations.

Below, we describe the forward propagation of an ANN. In other word, we explain how the $a^{[i]}$'s are computed.

We have $a^{[0]}=x$ where $x$ is the input-vector. Then, we can comput 
* $z^{[1]}=W^{[1]}x+b^{[1]}$ and $a^{[1]}=\sigma^{[1]}(z^{[1]})=[\sigma^{[1]}(z^{[1]}_{1}),\dots,\sigma^{[1]}(z^{[1]}_{n^{[1]}})]^{T}$. 
* $z^{[2]}=W^{[2]}a^{[1]}+b^{[2]}$ and $a^{[2]}=\sigma^{[2]}(z^{[2]})=\sigma^{[2]}(z^{[2]})=[\sigma^{[2]}(z^{[2]}_{1}),\dots,\sigma^{[2]}(z^{[2]}_{n^{[2]}})]^{T}$
* $z^{[i]}=W^{[i]}a^{[i-1]}+b^{[i]}$ and $a^{[i]}=\sigma^{[i]}(z^{[i]})=\sigma^{[i]}(z^{[i]})=[\sigma^{[i]}(z^{[i]}_{1}),\dots,\sigma^{[i]}(z^{[i]}_{n^{[i]}})]^{T}$
* ...
* $z^{[m]}=W^{[m]}a^{[m-1]}+b^{[m]}$ and $a^{[m]}=\sigma^{[m]}(z^{[m]})=\sigma^{[m]}(z^{[m]})=[\sigma^{[m]}(z^{[m]}_{1}),\dots,\sigma^{[m]}(z^{[m]}_{n^{[m]}})]^{T}$.
* The output-vector is $\hat{y}=a^{[m]}=\sigma^{[m]}(z^{[m]})$. 

$\hat{y}^{(i)}$ denots the Neural Network output corresponding to the input $x^{(i)}$. For all i, $\hat{y}^{(i)}$ depends of the parameters $W^{[1]},\dots,W_{[m]},b^{[1]},\dots,b^{[m]}.$ 

We introduce the loss function $\mathcal{L}$ that allow us to measure the difference between $\hat{y}^{(i)}$ and the $y^{(i)}$ for all i. The cost function of the overall training sample is given by
$$C(W,b)=\frac{1}{n}\sum_{i=1}^{n}\mathcal{L}(\hat{y}^{(i)},y^{(i)}) $$ 

When we deal with a classification problem, the loss function can be chosen from the following:
* categorical_crossentropy: 
<center>
<img src="https://raw.githubusercontent.com/fabnancyuhp/DEEP-LEARNING/main/IMAGE/DENSENN/loss1.png" width="50%" style="width:600px;margin:0-300px;" />
</center>

 where p is the output size and $\hat{y}^{(i)}$ is a probability vector.
* binary_crossentropy (The target can contain multiple independent binary labels): 
<center>
<img src="https://raw.githubusercontent.com/fabnancyuhp/DEEP-LEARNING/main/IMAGE/DENSENN/loss2.png" width="90%"  />
</center>
where p is the output size and $\hat{y}^{(i)}_{k}$ is a probability vector.

When we deal with a regression problem, the loss function can be chosen from the following:
* Mean squared error: $\mathcal{L}(\hat{y}^{(i)},y^{(i)})=\frac{1}{p}\sum_{k=1}^{p}(y^{(i)}_{k}-\hat{y}^{(i)}_{k})^{2}$
* Mean absolute error: $\mathcal{L}(\hat{y}^{(i)},y^{(i)})=\frac{1}{p}\sum_{k=1}^{p}\lvert y^{(i)}_{k}-\hat{y}^{(i)}_{k}\rvert$ where p is the output size.
* Mean squared logarithmic error: $\mathcal{L}(\hat{y}^{(i)},y^{(i)})=\frac{1}{p}\sum_{k=1}^{p}\left(\log( y^{(i)}_{k}+1)-\log(\hat{y}^{(i)}_{k}+1)\right)^{2}$ where p is the output size.

Generally, we add a regularization term to the cost function:
* $L^{2}$ regularization : $J(W,b)= C(W,b)+\frac{\lambda}{2m_{entries}}\|W \|^{2}=\frac{1}{n}\sum_{i=1}^{n}\mathcal{L}(\hat{y}^{(i)},y^{(i)})+\frac{\lambda}{2m_{entries}}\|W \|^{2}$. 
* $L^{1}$ regularization : $J(W,b)= C(W,b)+\frac{\lambda}{2m_{entries}}\|W \|^{2}=\frac{1}{n}\sum_{i=1}^{n}\mathcal{L}(\hat{y}^{(i)},y^{(i)})+\frac{\lambda}{2m_{entries}}\|W \|$

In the previous equations, m is the number of parameters. In fact, m is the total count of the elements of the set of the weight matrices $W=\{W^{[1]},\dots,W^{[m]} \}$<br>

We want to approximate $W^{opt},b^{opt}=\underset{W,b}{\mathrm{argmin}} J(W,b)$ using an optimisation algorithm such that (obviously the list below is not exhaustive):
* Gradient descent (with momentum) optimizer
* RMSprop optimizer 
* Adam optimizer
* Adadelta

The above algorithms are gradient descent algorithms. In order to implement these optimisation algorithms we need to compute the partial derivatives of $J(W,b)$ with respect to $W^{[1]},\dots,W^{[m]},b^{[1]},\dots,b^{[m]}$. the gradient descent update rule is for $k=1,\dots,m$:
$$\begin{eqnarray}
W^{[k]} &=& W^{[k]}-\alpha \frac{\partial J(W,b)}{\partial W^{[k]}}\\
b^{[k]} &=&b^{[k]}-\alpha \frac{\partial J(W,b)}{\partial b^{[k]}}
\end{eqnarray}$$
where m is the number of layers and $\alpha$ is the learning rate. When we start applying a gradient method algorithm 
* we have to initialize the weights and biases. We deal with Weight Initialization in a further section.
* We compute partial derivatives with Chain Rule. This fact leads to the backpropagation algorithm.

#  Backpropagation algorithm : the heart  of neural network training. 
Backpropagation refers to the method of calculating the gradient of neural network parameters. In short, the method traverses the network in reverse order, from the output to the input layer, according to the chain rule from calculus. The algorithm stores any intermediate variables (partial derivatives) required while calculating the gradient with respect to some parameters.<br><br>
Backpropagation is just a very computationally efficient approach to compute the derivatives of the objective function $J(W,b)$ and our goal is to use those derivatives to learn the weight coefficients for parameterizing a multi-layer artificial neural network.<br><br>
In other words, Backpropagation calculates the gradient of the objective function $J(W,b)$ with respect to all the weights in the network, so that the gradient is fed to a gradient descent algorithms which in turn uses it to update the weights in order to minimize $J(W,b)$.<br><br> 
Here is an illustration of the inner working of the backpropagation algorithm. The notations differ a bit from the one of this course.



<center>
<img src="https://raw.githubusercontent.com/fabnancyuhp/DEEP-LEARNING/main/IMAGE/DENSENN/diagram-of-backpropagation-training-algorithm-and-typical-neuron-model_W640.jpg" width="40%" />
</center>

* After propagating the input features forward to the output layer through the various hidden layers consisting of different/same activation functions, we come up with a predicted probability of a sample belonging to the positive class ( generally, for classification tasks).

* Now, the backpropagation algorithm propagates backward from the output layer to the input layer calculating the error gradients on the way.

* Once the computation for gradients of the cost function w.r.t each parameter (weights and biases) in the neural network is done, the algorithm takes a gradient descent step towards the minimum to update the value of each parameter in the network using these gradients.
* The number of epochs is defining how much time the whole training set will be passed through the network during the training stage

# Variants of Gradient Descent: Batch, Stochastic, and Minibatch

A neural network is trained using a collected set of input data called batch. There are three variants of Gradient Descent: Batch, Stochastic, Minibatch and .

* **Stochastic Gradient Descent**: Stochastic gradient descent, often abbreviated SGD, is a variation of the gradient descent algorithm that calculates the error and updates the model for each example in the training dataset. The update of the model for each training example means that stochastic gradient descent is often called an online machine learning algorithm.<br>
Since only a single training example is considered before taking a step in the direction of gradient, we are forced to loop over the training set and thus cannot exploit the speed associated with vectorizing the code. It makes very noisy updates in the parameters.
* **Batch Gradient Descent**: Batch gradient descent is a variation of the gradient descent algorithm that calculates the error for each example in the training dataset, but only updates the model after all training examples have been evaluated. One cycle through the entire training dataset is called a **training epoch**. Therefore, it is often said that batch gradient descent performs model updates at the end of each training epoch.<br>
Since entire training data is considered before taking a step in the direction of gradient, therefore it takes a lot of time for making a single update. It makes smooth updates in the model parameters.
* **Mini-Batch Gradient Descent**: Mini-batch gradient descent is a variation of the gradient descent algorithm that splits the training dataset into small batches that are used to calculate model error and update model coefficients.Implementations may choose to sum the gradient over the mini-batch which further reduces the variance of the gradient. Mini-batch gradient descent seeks to find a balance between the robustness of stochastic gradient descent and the efficiency of batch gradient descent. It is the most common implementation of gradient descent used in the field of deep learning.<br>
Since a subset of training examples is considered, it can make quick updates in the model parameters and can also exploit the speed associated with vectorizing the code. Depending upon the batch size, the updates can be made less noisy – greater the batch size less noisy is the update.

In Stochastic Gradient Descent, you use only 1 training example before updating the gradients. When the training set is large, SGD can be faster. But the parameters will "oscillate" toward the minimum rather than converge smoothly. Here is an illustration of this (The illustration below comes from Andrew Ng's mooc):
<center>
<img src="https://raw.githubusercontent.com/fabnancyuhp/DEEP-LEARNING/main/IMAGE/DENSENN/kiank_sgd.png" width="60%" />
</center>
In practice, we'll often get faster results if we do not use neither the whole training set, nor only one training example, to perform each update. Mini-batch gradient descent uses an intermediate number of examples for each step. With mini-batch gradient descent, we loop over the mini-batches instead of looping over individual training examples (The illustration below comes from Andrew Ng's mooc).
<center>
<img src="https://raw.githubusercontent.com/fabnancyuhp/DEEP-LEARNING/main/IMAGE/DENSENN/kiank_minibatch.png" width="60%" />
</center>

## Momentum
Because mini-batch gradient descent makes a parameter update after seeing just a subset of examples, the direction of the update has some variance, and so the path taken by mini-batch gradient descent will "oscillate" toward convergence. Using momentum can reduce these oscillations.<br><br>
Momentum takes into account the past gradients to smooth out the update. We will store the 'direction' of the previous gradients in a variable. Formally, this will be the exponentially weighted average of the gradient on previous steps.<br><br>
**Adam** is one of the most effective optimization algorithms for training neural networks. It combines ideas from RMSProp and Momentum. As mentioned in an earlier section, RMSProp is an other optimization algorithms for training neural networks.  

# Batch Normalization

**Normalization** is a data pre-processing tool used to bring the numerical data to a common scale without distorting its shape.<br><br>
Generally, when we input the data to a machine or deep learning algorithm we tend to change the values to a  balanced scale. The reason we normalize is partly to ensure that our model can generalize appropriately.<br><br>
**Batch normalization** is a process to make neural networks faster and more stable through adding extra layers in a deep neural network. The new layer performs the standardizing and normalizing operations on the input of a layer coming from a previous layer.<br><br>
But what is the reason behind the term “Batch” in batch normalization? A typical neural network is trained using a collected set of input data called batch. Similarly, the normalizing process in batch normalization takes place in batches, not as a single input.<br><br>
**How does Batch Normalization work?**
* **Normalization of the Input**
  * Normalization is the process of transforming the data to have a mean zero and standard deviation one. In this step we have our batch input from layer h, first, we need to calculate the mean of this hidden activation.
$$\mu = \frac{1}{n^{[h]}}\sum_{i=1}^{n^{[h]}}a^{[h]}_{i}$$
We use the notations introduced in the section MATHEMATICAL NOTATION/ ANN modelisation. Here, $n^{[h]}$ is the number of neurons at layer h. 
  * Once we have meant at our end, the next step is to calculate the standard deviation of the hidden activations.
  $$\sigma=\left[\frac{1}{n^{[h]}}\sum_{i=1}^{n^{[h]}}(a^{[h]}_{i}-\mu)^{2} \right]^{\frac{1}{2}} $$
  Further, as we have the mean and the standard deviation ready. We will normalize the hidden activations using these values. For this, we will subtract the mean from each input and divide the whole value with the sum of standard deviation and the smoothing term ($\varepsilon$).<br><br>
  The smoothing term($\varepsilon$) assures numerical stability within the operation by stopping a division by a zero value.
  $$a^{[h]}_{i(norm)}=\frac{a^{[h]}_{i}-\mu}{\sigma+\varepsilon} $$
* **Rescaling of Offsetting:** In the final operation, the re-scaling and offsetting of the input take place. Here two components of the BN algorithm come into the picture, $\gamma$ and $\beta$. These parameters are used for re-scaling ($\gamma$) and shifting($\beta$) of the vector containing values from the previous operations.
$$a^{[h]}_{i}=\gamma a^{[h]}_{i(norm)}+\beta $$
These two are learnable parameters, during the training neural network ensures the optimal values of $\gamma$ and $\beta$ are used. That will enable the accurate normalization of each batch.

**One of the main advantages of Batch Normalization is to Speed Up the Training.**

Here an example of batch normalization implemented in Keras:
``` python
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, BatchNormalization, Flatten

model = Sequential()
model.add(Flatten(input_shape=[28, 28]))
model.add(BatchNormalization())
model.add(Dense(300,activation='relu'))
model.add(BatchNormalization())
model.add(Dense(100,activation='relu'))
model.add(BatchNormalization())
model.add(Dense(10,activation='softmax'))
model.summary()
``` 

# Weight Initialization for Deep Learning Neural Networks/ ANN

Weight initialization is used to define the initial values for the parameters in neural network models prior to training the models on a dataset.<br><br>
The nodes in neural networks are composed of parameters referred to as weights used to calculate a weighted sum of the inputs.<br><br>
Neural network models are fit using an optimization algorithm that incrementally changes the network weights to minimize a loss function. This optimization algorithm requires a starting point in the space of possible weight values from which to begin the optimization process.<br><br>
There are several weight initialization techniques:
* The current standard approach for initialization of the weights of neural network layers and nodes that use the Sigmoid or TanH activation function is called **“glorot” or “xavier” initialization**. 
* The current standard approach for initialization of the weights of neural network layers and nodes that use the rectified linear (ReLU) activation function is called **“he” initialization.**




Keras has a wide range of weight initialization methods:
 * tf.keras.initializers.RandomNormal(mean=0.0, stddev=0.05, seed=None) : Initializer that generates tensors with a normal distribution.
 * tf.keras.initializers.RandomUniform(minval=-0.05, maxval=0.05, seed=None) : Initializer that generates tensors with a uniform distribution.
 * tf.keras.initializers.GlorotNormal(seed=None) : The Glorot normal initializer, also called Xavier normal initializer.
 * tf.keras.initializers.GlorotUniform(seed=None) : The Glorot uniform initializer, also called Xavier uniform initializer.
 * tf.keras.initializers.HeNormal(seed=None) : He normal initializer.
 * tf.keras.initializers.HeUniform(seed=None) : He uniform variance scaling initializer.


Below an example of a 30 neurons dense layer using sigmoid activation function and glorot_uniform weights initialization:
``` python
from tensorflow.keras import layers
from tensorflow.keras import initializers
from tensorflow.keras import activations

layer = layers.Dense(units=30,activation='sigmoid',kernel_initializer='glorot_uniform')
layerbis = layers.Dense(units=30,activation= 'sigmoid',kernel_initializer=initializers.GlorotUniform(seed=None))
``` 

Below an example 21 neurons dense layer using  He uniform initialization followed by a 15 neurons dense layer with The Glorot uniform initializer. 
``` python
from tensorflow.keras.layers import Dense
from tensorflow.keras.models import Sequential

model = Sequential()
model.add(Dense(units=21,input_dim=10,activation='relu',kernel_initializer='he_uniform'))
model.add(Dense(units=15,activation='sigmoid',kernel_initializer='glorot_uniform'))
```

# Regularization

There are several ways of controlling the capacity of Neural Networks to prevent overfitting: **L2 regularization, L1 regularization, Dropout**.
 
## L2 & L1 regularization
L1 and L2 are the most common types of regularization. These update the general cost function by adding another term known as the regularization term.
<center>
Cost function = Loss + Regularization term
</center>
Due to the addition of this regularization term, the values of weight matrices decrease because it assumes that a neural network with smaller weight matrices leads to simpler models. Therefore, it will also reduce overfitting to quite an extent.<br><br>
However, this regularization term differs in L1 and L2.<br>
In L2, we have:
$$Cost = Loss+\frac{\lambda}{2m}\times \sum \|w \|^{2} $$
Here, lambda is the regularization parameter. It is the hyperparameter whose value is optimized for better results. L2 regularization is also known as weight decay as it forces the weights to decay towards zero (but not exactly zero).<br><br>
In L1, we have:
$$Cost = Loss+\frac{\lambda}{2m}\times \sum \|w \| $$

In this, we penalize the absolute value of the weights. Unlike L2, the weights may be reduced to zero here. Hence, it is very useful when we are trying to compress our model. Otherwise, we usually prefer L2 over it.<br>

Below is the sample code to apply L2 regularization to a Dense layer.

``` python
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras import regularizers

model = Sequential()
model.add(Dense(64, input_dim=64,kernel_regularizer=regularizers.l2(0.01)))
```
## Dropout

When a fully-connected layer has a large number of neurons, co-adaption is more likely to happen. Co-adaptation refers to when multiple neurons in a layer extract the same, or very similar, hidden features from the input data. This can happen when the connection weights for two different neurons are nearly identical.<br>
This poses two different problems:
* Wastage of machine’s resources when computing the same output.
* If many neurons are extracting the same features, it adds more significance to those features for our model. This leads to overfitting if the duplicate extracted features are specific to only the training set.

We use dropout while training the NN to minimize co-adaption.
In dropout, we randomly shut down some fraction of a layer’s neurons at each training step by zeroing out the neuron values. The fraction of neurons to be zeroed out is known as the dropout rate $r_{d}$. The remaining neurons have their values multiplied by  $\frac{1}{1 - r_{d}}$  so that the overall sum of the neuron values remains the same.
The two images represent dropout applied to a 3 layers ANN.
<center>
<img src="https://raw.githubusercontent.com/fabnancyuhp/DEEP-LEARNING/main/IMAGE/DENSENN/A-Dense-Fully-Connected-Neural-Network-with-and-without-Dropout.png" width="50%" />
</center>

## Data augmentation

**Data augmentation** involves the process of creating new data samples by manipulating the original data. The simplest way to reduce overfitting is to increase the size of the training data. In machine learning, we were not able to increase the size of training data as the labeled data was too costly.<br>

**IMAGES**<br>
Images are a great way to illustrate data augmentation. In order to train an algorithm to recognize an image of a dog, you will need a training dataset that contains different images of a dog. In case of having too many images of a dog that look the same, you can enrich your training dataset with data augmentation to avoid overfitting.
<center>
<img src="https://raw.githubusercontent.com/fabnancyuhp/DEEP-LEARNING/main/IMAGE/DENSENN/imageaugmentation.png" width="35%" />
</center>

**TEXT**: For textual data it is also possible to apply adequate augmentations: one can introduce synonyms, manipulate punctuation, etc.<br><br>
**AUDIO DATA**:
Depending on what is your objective, some of the transformations you can do are adding noise, doing some equalizer transformations, cut-offs, etc.<br><br>




# Exploding Gradients and Vanishing Gradients

We know that the backpropagation algorithm is the heart of neural network training. In fact, The training stage of a neural network is the backpropagation algorithm. Vanishing Gradients and Exploding Gradients occur during the backpropagation algorithm. Now, we can define Exploding Gradients and Vanishing Gradients:<br>
* **Vanishing Gradients:** As the backpropagation algorithm advances downwards(or backward) from the output layer towards the input layer, the gradients often get smaller and smaller and approach zero which eventually leaves the weights of the initial or lower layers nearly unchanged. As a result, the gradient descent never converges to the optimum. This is known as the vanishing gradients problem.
* **Exploding Gradients:** On the contrary, in some cases, the gradients keep on getting larger and larger as the backpropagation algorithm progresses. This, in turn, causes very large weight updates and causes the gradient descent to diverge. This is known as the exploding gradients problem. 

Here, we list some techniques to fix the vanishing and exploding gradients problems:
  
* **Using Non-saturating Activation Functions**: the fact that the sigmoid activation function saturates for larger inputs (negative or positive) came out to be a major reason behind the vanishing of gradients thus making it non-recommendable to use in the hidden layers of the network. So to tackle the issue regarding the saturation of activation functions like sigmoid and tanh, we must use some other non-saturating functions like ReLu and its alternatives.
* Using He initialization along with any variant of the ReLU activation function can significantly reduce the chances of vanishing/exploding problems at the beginning. However, it does not guaranteethat the problem won’t reappear during training.
* Use batch Normalization layers
* Use Weight Regularization like $L1$ and $L2$ regularization
* Re-Design the Network Model : In deep neural networks, exploding gradients may be addressed by redesigning the network to have fewer layers. There may also be some benefit in using a smaller batch size while training the network.
* **Gradient Clipping** : Another popular technique to mitigate the exploding gradients problem is to clip the gradients during backpropagation so that they never exceed some threshold. This is called Gradient Clipping.
   * This keras optimizer will clip every component of the gradient vector to a value between –1.0 and 1.0.
   * Meaning, all the partial derivatives of the loss w.r.t each  trainable parameter will be clipped between –1.0 and 1.0
   ``` python
from tensorflow.keras.optimizers import SGD 
optimizer = SGD(clipvalue = 1.0)
   ```
   * To ensure that the orientation remains intact even after clipping, we should clip by norm rather than by value.
   ``` python
from tensorflow.keras.optimizers import SGD 
optimizer = SGD(clipnorm = 1.0)
   ```

Here, we apply batch normalization, dropout, regularization, and weights initialization.

``` python
from tensorflow.keras.layers import Dense, BatchNormalization, Dropout
from tensorflow.keras.models import Sequential
from tensorflow.keras.regularizers import l2

model = Sequential()
model.add(Dense(100,activation='sigmoid',kernel_initializer='glorot_normal',input_shape=(10,)))
model.add(BatchNormalization())
model.add(Dropout(0.2))
model.add(Dense(100,activation='relu',kernel_initializer='he_normal'))
model.add(BatchNormalization())
model.add(Dropout(0.2))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
   ```