# Understanding L2 regularization, Weight decay and AdamW
> A post explaining L2 regularization, Weight decay and AdamW optimizer as described in the paper Decoupled Weight Decay Regularization we will also go over how to implement these using tensorflow2.x .

- toc: false
- badges: true
- comments: true
- categories: [machinelearning deeplearning python3.x tensorflow2.x]

## What is regularization ?

In simple words regularization helps in reduces over-fitting on the data. There are many regularization strategies.

The major regularization techniques used in practice are:
* L2 Regularization
* L1 Regularization
* Data Augmentation
* Dropout
* Early Stopping

## L2 regularization :

In L2 regularization, an extra term often referred to as regularization term is added to the loss function of the network.
 
Consider the the following cross entropy loss function (without regularization): 

$$loss= -\frac{1}{m} \sum\limits_{i = 1}^{m} (y^{(i)}\log\left(yhat^{(i)}\right) + (1-y^{(i)})\log\left(1-yhat^{(i)}\right)) $$

To apply L2 regularization to the loss function above we add the term given below to the loss function :

$$\frac{\lambda}{2m}\sum\limits_{w}w^{2} $$

where $\lambda$ is a hyperparameter of the model known as the regularization parameter. $\lambda$ is a hyper-parameter which means it is not learned during the training but is tuned by the user manually

After applying the `regularization term` to our original loss function :
$$finalLoss= -\frac{1}{m} \sum\limits_{i = 1}^{m} (y^{(i)}\log\left(yhat^{(i)}\right) + (1-y^{(i)})\log\left(1-yhat^{(i)}\right)) + \frac{\lambda}{2m}\sum\limits_{w}w^{2}$$

or , 
$$ finalLoss = loss+ \frac{\lambda}{2m}\sum\limits_{w}w^{2}$$


or in simple code :


```python
final_loss = loss_fn(y, y_hat) + lamdba * np.sum(np.pow(weights, 2)) / 2
final_loss = loss_fn(y, y_hat) + lamdba * l2_reg_term 
```

> Note: all code equations are written in python, numpy notation.

Cosequently the weight update step for **vanilla SGD** is going to look something like this:
```python
w = w - learning_rate * grad_w - learning_rate * lamdba * grad(l2_reg_term, w)
w = w - learning_rate * grad_w - learning_rate * lamdba * w
```

> Note: assume that grad_w is the gradients of the loss of the model wrt weights of the model.

> Note: assume that grad(a,b) calculates the gradients of a wrt to b.

## Weight Decay :

In weight decay we do not modify the loss function, the loss function remains the instead instead we modfy the update step :


The loss remains the same :

```python
final_loss = loss_fn(y, y_hat)
```

During the update parameters :

```python
w = w - learing_rate * grad_w - learning_rate * lamdba * w
```

> Important: From the above equations **weight_decay** and **L2 regularization** may seem the same and it is infact same for **vanilla SGD** , but as soon as we add momentum, or use a more sophisticated optimizer **like Adam**, **L2 regularization** and **weight decay** become different.