## Regularizers:

### TL;DR:

You have the regression equation $y = Wx+b$, where $x$ is the input, $W$ the weights matrix and $b$ the bias.

 - Kernel Regularizer: Tries to reduce the weights $W$ (excluding bias).
 - Bias Regularizer: Tries to reduce the bias $b$.
 - Activity Regularizer: Tries to reduce the layer's output $y$, thus will reduce the
   weights and adjust bias so $Wx+b$ is smallest.

Usually if you have no prior on the distribution that you wish to model, you would only use the Kernel Regularizer, since a large enough network can still model your function even if the regularization on the weights are big.

If you want the output function to pass through (or have an intercept closer to) the origin, you can use the Bias Regularizer.  
If you want the output to be smaller (or closer to 0), you can use the Activity Regularizer.


Now for the $L1$ versus $L2$ loss for **weight decay** (not to be confused with the outputs loss function).  
$L2$ loss is defined as $w^2$  
$L1$ loss is defined as $|w|$.  
$w$ is a component of the matrix $W$.

The gradient of $L2$ will be: $2w$  
The gradient of $L1$ will be: $sign(w)$

Thus, for each gradient update with a learning rate $a$, in $L2$ loss, the weights will be subtracted by $aW$, while in $L1$ loss they will be subtracted by $a \cdot sign(W)$.

The effect of $L2$ loss on the weights is a reduction of large components in the matrix $W$ while $L1$ loss will make the weights matrix sparse, with many zero values. The same applies on the bias and output respectively using the bias and activity regularizer.

#### NOTE 1:
- If you apply L2 weight decay, the network will try to be less sensitive to small changes.
- If you apply L1 weight decay, it goes further, the network will try to ignore some inputs altogether. 

#### NOTE 2:

- `kernel_regularizer` acts on the weights, while `bias_initializer` acts on the bias and `activity_regularizer` acts on the y(layer output).

- We apply `kernel_regularizer` to punish the weights which are very large causing the network to overfit, after applying `kernel_regularizer` the weights will become smaller.

- While we `bias_regularizer` to add a bias so that our bias approaches towards zero.

- `activity_regularizer` tries to make the output smaller so as to remove overfitting.




## Explanation about L1, L2 and L1_L2 Regularizers:


<p>
  <img  src=assets/1.png/ >
</p>

<p>
  <img  src=assets/2.png/>
</p>

<p>
  <img  src=assets/3.png/>
</p>

<p>
  <img  src=assets/4.png/>
</p>

<p>
  <img  src=assets/5.png/>
</p>

<p>
  <img  src=assets/l1_component.png/>
</p>

<p>
  <img  src=assets/5_1.png/>
</p>

<p>
  <img  src=assets/l1_deriv.png/>
</p>

<p>
  <img  src=assets/5_2.png/>
</p>

<p>
  <img  src=assets/6.png/>
</p>

<p>
  <img  src=assets/7.png/>
</p>

<p>
  <img  src=assets/l2_comp.png/>
</p>

<p>
  <img  src=assets/8.png/>
</p>

<p>
  <img  src=assets/l1_deriv.png/>
</p>

<p>
  <img  src=assets/9.png/>
</p>

<p>
  <img  src=assets/l2_deriv.png/>
</p>

<p>
  <img  src=assets/10.png/>
</p>

<p>
  <img  src=assets/11.png/>
</p>

<p>
  <img  src=assets/12.png/>
</p>

<p>
  <img  src=assets/elastic_net.png/>
</p>

<p>
  <img  src=assets/13.png/>
</p>

[REFERENCES: Many thanks to Christian for his comprehensive article](https://www.machinecurve.com/index.php/2020/01/21/what-are-l1-l2-and-elastic-net-regularization-in-neural-networks/)

[How to use L1, L2 and Elastic Net Regularization with Keras?](https://www.machinecurve.com/index.php/2020/01/23/how-to-use-l1-l2-and-elastic-net-regularization-with-keras/)