## Regularizers:

### TL;DR:

You have the regression equation $y = Wx+b$, where $x$ is the input, $W$ the weights matrix and $b$ the bias.

 - Kernel Regularizer: Tries to reduce the weights $W$ (excluding bias).
 - Bias Regularizer: Tries to reduce the bias $b$.
 - Activity Regularizer: Tries to reduce the layer's output $y$, thus will reduce the
   weights and adjust bias so $Wx+b$ is smallest.

Usually if you have no prior on the distribution that you wish to model, you would only use the Kernel Regularizer, since a large enough network can still model your function even if the regularization on the weights are big.

If you want the output function to pass through (or have an intercept closer to) the origin, you can use the Bias Regularizer.  
If you want the output to be smaller (or closer to 0), you can use the Activity Regularizer.


Now for the $L1$ versus $L2$ loss for **weight decay** (not to be confused with the outputs loss function).  
$L2$ loss is defined as $w^2$  
$L1$ loss is defined as $|w|$.  
$w$ is a component of the matrix $W$.

The gradient of $L2$ will be: $2w$  
The gradient of $L1$ will be: $sign(w)$

Thus, for each gradient update with a learning rate $a$, in $L2$ loss, the weights will be subtracted by $aW$, while in $L1$ loss they will be subtracted by $a \cdot sign(W)$.

The effect of $L2$ loss on the weights is a reduction of large components in the matrix $W$ while $L1$ loss will make the weights matrix sparse, with many zero values. The same applies on the bias and output respectively using the bias and activity regularizer.

#### NOTE 1:
- If you apply L2 weight decay, the network will try to be less sensitive to small changes.
- If you apply L1 weight decay, it goes further, the network will try to ignore some inputs altogether. 

#### NOTE 2:

- `kernel_regularizer` acts on the weights, while `bias_initializer` acts on the bias and `activity_regularizer` acts on the y(layer output).

- We apply `kernel_regularizer` to punish the weights which are very large causing the network to overfit, after applying `kernel_regularizer` the weights will become smaller.

- While we `bias_regularizer` to add a bias so that our bias approaches towards zero.

- `activity_regularizer` tries to make the output smaller so as to remove overfitting.




## Explanation about L1, L2 and L1_L2 Regularizers:


<p>
  <img  src=assets/1.png/ >
</p>

<p>
  <img  src=assets/2.png/>
</p>

<p>
  <img  src=assets/3.png/>
</p>

<p>
  <img  src=assets/4.png/>
</p>

<p>
  <img  src=assets/5.png/>
</p>

<p>
  <img  src=assets/l1_component.png/>
</p>

<p>
  <img  src=assets/5_1.png/>
</p>

<p>
  <img  src=assets/l1_deriv.png/>
</p>

<p>
  <img  src=assets/5_2.png/>
</p>

<p>
  <img  src=assets/6.png/>
</p>

<p>
  <img  src=assets/7.png/>
</p>

<p>
  <img  src=assets/l2_comp.png/>
</p>

<p>
  <img  src=assets/8.png/>
</p>

<p>
  <img  src=assets/l1_deriv.png/>
</p>

<p>
  <img  src=assets/9.png/>
</p>

<p>
  <img  src=assets/l2_deriv.png/>
</p>

<p>
  <img  src=assets/10.png/>
</p>

<p>
  <img  src=assets/11.png/>
</p>

<p>
  <img  src=assets/12.png/>
</p>

<p>
  <img  src=assets/elastic_net.png/>
</p>

<p>
  <img  src=assets/13.png/>
</p>

[REFERENCES: Many thanks to Christian for his comprehensive article](https://www.machinecurve.com/index.php/2020/01/21/what-are-l1-l2-and-elastic-net-regularization-in-neural-networks/)

[How to use L1, L2 and Elastic Net Regularization with Keras?](https://www.machinecurve.com/index.php/2020/01/23/how-to-use-l1-l2-and-elastic-net-regularization-with-keras/)

### Differences between L1 and L2 as Loss function and  Regularization:


<p>
  <img  src=assets/L1_L2_loss.png/>
</p>

<p>
  <img  src=assets/L1-vs-L2-properties-loss-function.png/>
</p>

Robustness, per wikipedia, is explained as:

`The method of least absolute deviations finds applications in many areas, due to its robustness compared to the least squares method. Least absolute deviations is robust in that it is resistant to outliers in the data. This may be helpful in studies where outliers may be safely and effectively ignored. If it is important to pay attention to any and all outliers, the method of least squares is a better choice.`

Intuitively speaking, since a L2-norm squares the error (increasing by a lot if error > 1), the model will see a much larger error ( e vs e^2 ) than the L1-norm, so the model is much more sensitive to this example, and adjusts the model to minimize this error. If this example is an outlier, the model will be adjusted to minimize this single outlier case, at the expense of many other common examples, since the errors of these common examples are small compared to that single outlier case.


<p>
  <img  src=assets/l1_l2_reg.png/>
</p>

<p>
  <img  src=assets/L1-vs-L2-properties-regularization.png/>
</p>


### Dropouts:

Essentially, Dropout act as a regularization, and what it does is to make the network less prone to overfitting.

As we already know, the deeper the network is, the more parameter it has. For example, VGGNet from ImageNet competition 2014, has some 148 million parameters. That’s a lot. With that many parameters, the network could easily overfit, especially with small dataset.

Enter Dropout.

In training phase, with Dropout, at each hidden layer, with probability p, we kill the neuron. What it means by ‘kill’ is to set the neuron to 0. As neural net is a collection multiplicative operations, then those 0 neuron won’t propagate anything to the rest of the network.

<p>
  <img  src=assets/dropout.png/>
</p>

Let n be the number of neuron in a hidden layer, then the expectation of the number of neuron to be active at each Dropout is `p*n`, as we sample the neurons uniformly with probability p. Concretely, if we have 1024 neurons in hidden layer, if we set `p = 0.5`, then we can expect that only half of the neurons (512) would be active at each given time.

Because we force the network to train with only random `p*n` of neurons, then intuitively, we force it to learn the data with different kind of neurons subset. The only way the network could perform the best is to adapt to that constraint, and learn the more general representation of the data.

It’s easy to remember things when the network has a lot of parameters (overfit), but it’s hard to remember things when effectively the network only has so many parameters to work with. Hence, the network must learn to generalize more to get the same performance as remembering things.

So, that’s why Dropout will increase the test time performance: it improves generalization and reduce the risk of overfitting.

Let’s see the concrete code for Dropout:

#### Dropout training
```
u1 = np.random.binomial(1, p, size=h1.shape)
h1 *= u1
```

First, we sample an array of independent Bernoulli Distribution, which is just a collection of zero or one to indicate whether we kill the neuron or not. For example, the value of u1 would be `np.array([1, 0, 0, 1, 1, 0, 1, 0])`. Then, if we multiply our hidden layer with this array, what we get is the originial value of the neuron if the array element is 1, and 0 if the array element is also 0.

For example, after Dropout, we need to do `h2 = np.dot(h1, W2)`, which is a multiplication operation. What is zero times x? It’s zero. Then the subsequent multiplications would be also zero. That’s why those 0 neurons won’t contribute anything to the rest of the propagation.

Now, because we’re only using p*n of the neurons, the output then has the expectation of p*x, if x is the expected output if we use all the neurons (without Dropout).

As we don’t use Dropout in test time, then the expected output of the layer is x. That doesn’t match with the training phase. What we need to do is to make it matches the training phase expectation, so we scale the layer output with p.

#### Dropout training, notice the scaling of 1/p
```
u1 = np.random.binomial(1, p, size=h1.shape) / p
h1 *= u1
```

With that code, we essentially make the expectation of layer output to be x instead of px, because we scale it back with 1/p. Hence in the test time, we don’t need to do anything as the expected output of the layer is the same.