# Explanation

Weight decay is one of the simplest early regularization strategies, which makes it a great start to understand approaches to regularization more broadly.

### Intuition

Standard feed-forward networks (without modifications) are prone to overfitting to datasets when they become large enough to perfectly model the exact correlations in the dataset.

This is problematic since we want the model to learn useful representations that reflect the true distribution, without learning the effects of noise. **Regularization** is the effort to make models _generalize_ better so they no longer learn this noise.

Empirically, models with larger weights appear to be significantly more prone to overfitting.

Why is this the case?

Larger weights mean the model will be more sensitive to small changes in inputs - as these small changes will be magnified, sometimes exponentially, through multiple layers of multiplication with larger coefficients.

Sensitivity to small changes also means the model will respond more to _noise_ in the dataset, which is exactly what we want to avoid to make our model generalize well.

**We want to add some incentive for our model to keep it's weights as small as possible.**

There will be some optimal tradeoff between keeping weights small enough to not magnify noise, while making them large enough to actually distinguish useful features.

### Math

We can easily add this incentive to keep weights as small as possible (but still large enough to be useful) by modifying the loss function.

A standard cost function for a feed-forward network is the following mean-squared error loss:

$$E_0(w) = \frac{1}{2} \sum_{\mu = 1}^{p} [f_u(\mathcal{E}^{\mu}) - f_w(\mathcal{E}^\mu)]^2$$

Clearly, this cost function incentivizes the network to directly minimize the error between predictions and real values, but does nothing to incentivize low weights.

We can add this incentive by simpling adding another term to the cost function that grows with the weights.

$$E(w) = E_0(w) + \frac{1}{2} \lambda \sum_i w_i^2$$

Adding the square of all the weight vectors (the L2 norm) to the loss function means the loss now grows with increasing weights, so the network is incentivized to keep weights as low as possible.

This method is known as **L2 regularization**.

In order to make sure that the network doesn't over-prioritize weights (for example, just minimizes loss by decreasing weights and never actually learns the dataset), we scale down the penalty of the L2 norm with a hyperparameter $\lambda$.



# My Notes

## 📜 [A Simple Weight Decay Can Improve Generalization](https://proceedings.neurips.cc/paper/1991/file/8eefcfdf5990e441f0fb6f3fad709e21-Paper.pdf)

> It is proven that a weight decay has two effects in a linear network. First, it suppresses any irrelevant components of the weight vector by choosing the smallest vector that solves the learning problem. Second, if the size is chosen right, a weight decay can suppress some of the effects of static noise on the targets, which improves generalization quite a lot.

> Bad generalization occurs if the information [in the training set] does not match the complexity [of the network].

> A different way to constrain a network and thus decrease its complexity, is to limit the growth of the weights through some kind of weight decay.

> [This] can be realized by adding a term to the cost function that penalizes large weights.

$E(w) = E_0(w) + \frac{1}{2} \lambda \sum_i w_i^2$

> The aim of the learning is not only to learn the examples, but to learn the underlying function that produces the targets for the learning process.

> A small weight decay will pick out the point in the valley with the smallest norm among all the points in the valley.

> In general it can not be proven that picking that solution is the best strategy. But, at least from a philosophical point of view, it seems sensible, because it is (in a loose sense) the solution with the smallest complexity-the one that Ockham would probably have chosen.

Applying Ockham’s razor.

> The value of a weight decay is more evident if there are small errors in the targets.

Smaller weights means less ability to fit noise.