# Regularization for Deep Learning

A central problem in machine learning is how to make an algorithm that willperform well not just on the training data, but also on new inputs. Many strategiesused in machine learning are explicitly designed to reduce the test error, possiblyat the expense of increased training error. These strategies are known collectivelyas regularization. 

**Definition**: We deﬁned regularization as “any modiﬁcation we make to alearning algorithm that is intended to reduce its generalization error but not itstraining error.”

In the context of deep learning, most regularization strategies are based onregularizing estimators. Regularization of an estimator works by trading increasedbias for reduced variance. An eﬀective regularizer is one that makes a proﬁtabletrade, reducing variance signiﬁcantly while not overly increasing the bias.

Deep learning algorithms are typically applied to extremely complicated domains such as images, audio sequences and text, for which the true generation process essentially involves simulating the entire universe. To some extent, we are always trying to ﬁt a square peg (the data-generating process) intoa round hole (our model family).

We might ﬁnd—and indeed in practical deep learning scenarios,we almost always do ﬁnd—that the best ﬁtting model (in the sense of minimizinggeneralization error) is a large model that has been regularized appropriately.

## Parameter Norm Penalties

Many regularization approaches are based on limiting the capacity of models,such as neural networks, linear regression, or logistic regression, by adding a parameter norm penalty $\Omega(\theta)$ to the objective function J. We denote the regularized objective function by $\tilde{J}$:

$$
\tilde{J}(\theta;\textbf{X},y) = J(\theta;\textbf{X},y) + \alpha\Omega(\theta)
$$

where $\alpha\in[0,\infty)$ is a hyperparameter that weights the relative contribution of the norm penalty term, $\Omega$, relative to the standard objective function J. Setting $\alpha = 0$ results in no regularization. Larger values of $\alpha$ correspond to more regularization. Minimizes the regularized objective function $\tilde{J}$ it will decrease both the original objectiveJon the training data and some measure of the size of the parameters $\theta$ (or some subset of the parameters).

**Note:** for neural networks, we typically choose to use a parameter norm penalty $\Omega$ that penalizes only the weights of the affine transformation at each layer and leaves the baises unregularized. This means that we do not induce too muchvariance by leaving the biases unregularized. 

## $L^2$ Parameter Regularization

one of the simplest and most common kinds of parameter norm penalty: the $L^2$ parameter norm penalty commonly known as **weight decay**. This regularization strategy drives the weights closer to the origin1by adding a regularization term $\Omega(θ) = \frac{1}{2}\Vert w \Vert_2^2$ to the objective function. $L^2$ regularization is also known as **ridge regression**.

We can gain some insight into the behavior of weight decay regularizationby studying the gradient of the regularized objective function. To simplify thepresentation, we assume no bias parameter, so $\theta$ is just $w$. Such a model has thefollowing total objective function:

$$
\tilde{J}(w;,\textbf{X}, y) = \frac{\alpha}{2}w^Tw + J(w;,\textbf{X}, y)
$$

with corresponding parameter gradient

$$
\nabla_w\tilde{J}(w;,\textbf{X}, y) = \alpha w + \nabla_w J(w;,\textbf{X}, y)
$$

To take a single gradient step to update the weights, we perform this update:

$$
w \leftarrow w - \epsilon(\alpha w + \nabla_w J(w;,\textbf{X}, y))
$$
$$
\iff
$$
$$
w \leftarrow (1-\epsilon\alpha)w -\epsilon\nabla_w J(w;\textbf{W}, y)
$$

We can see that the addition of the weight decay term has modiﬁed the learningrule to multiplicatively shrink the weight vector by a constant factor on each step,just before performing the usual gradient update. This describes what happens ina single step. But what happens over the entire course of training?

We will further simplify the analysis by making a quadratic approximationto the objective function in the neighborhood of the value of the weights that obtains minimal unregularized training cost, $w_∗= \argmin_w J(w)$. If the objectivefunction is truly quadratic, as in the case of ﬁtting a linear regression model with mean squared error, then the approximation is perfect. The approximation $\hat{J}$ is given by

$$
\hat{J}(\theta) = J(w_*) + \frac{1}{2}(w-w^*)^T\textbf{H}(w-w^*)
$$

where H is the Hessian matrix of J with respect to w evaluated at $w^*$. There is no first-order term in this quadratic approximation, because $w^*$ is the location of a minimum of J, we can conclude that $\textbf{H}$ is positive semidefinite. The minimus of $\hat{J}$ occurs where its gradient

$$
\nabla_w\hat{J}(w) = \textbf{H} (w-w^*)
$$

is equal to $\textbf{0}$.

To study the effect of weight decay, modify above equation by adding the weight decay gradient. We can now solve for the minimum of the regularized version of $\hat{J}$. We use the variable $\hat{w}$ to represent the location of the minimum.

$$
\alpha\tilde{w} + \textbf{H}(\tilde{w}- w^*)=0
$$
$$
(\textbf{H}-\alpha\textbf{I})\tilde{w} = \textbf{H}w^*
$$
$$
\tilde{w} = (\textbf{H}-\alpha\textbf{I})^{-1}\textbf{H}w^*
$$

As $\alpha$ approaches 0, the regularized solution $\tilde{w}$ approaches $w^∗$. But what happens as $\alpha$ grows? Because $\textbf{H}$ is real and symmetric, we can decompose it into a diagonal matrix $\Lambda$ and an orthonormal basis of eigenvectors, Q, such that $H = Q\Lambda Q^T$. So we obtain

$$
\tilde{w} = (Q\Lambda Q^T-\alpha\textbf{I})^{-1}Q\Lambda Q^Tw^*
$$
$$
\tilde{w} = (Q(\Lambda -\alpha\textbf{I})Q^T)^{-1}Q\Lambda Q^Tw^*
$$
$$
\tilde{w} = Q(\Lambda -\alpha\textbf{I})^{-1}\Lambda Q^Tw^*
$$

We see that the eﬀect of weight decay is to rescale $w^∗$ along the axes deﬁned bythe eigenvectors of $\textbf{H}$. Speciﬁcally, the component of $w^∗$ that is aligned with the i-th eigenvector of $\textbf{H}$ is rescaled by a factor of $\frac{\lambda_i}{\lambda_i+\alpha}$.

Along the directions where the eigenvalues of $\textbf{H}$ are relatively large, for example,where $\lambda_i  >> \alpha$, the eﬀect of regularization is relatively small. Yet components with $\lambda_i << α$ w ill be shrunk to have nearly zero magnitude.

<img src=img/download.png>

An illustration of the effect of $L^2$(or weight decay) regularization on the value of the optimal w. The solid ellipses represent contours of equal value of the unregularized objective. The dotted circles represent contours of equal value of theL2regularizer. At the point $\tilde{w}$, these competing objectives reach an equilibrium. In the ﬁrst dimension, the eigenvalue of the Hessian of J is small. The objective function does not increase much when moving horizontally away from $w^∗$. Because the objective function does not express a strong preference along this direction, the regularizer has a strong effect on this axis.The regularizer pulls $w_1$ close to zero. In the second dimension, the objective function is very sensitive to movements away from $w^∗$. The corresponding eigenvalue is large,indicating high curvature. As a result, weight decay affects the position of $w^2$ relatively little.

### Example with actual quadratic cost function

For linear regression, the cost function is the sum of squared errors

$$
(Xw - y)^T(Xw-y)
$$

when we add $L^2$ regularization, the objective function changes to 

$$
(Xw - y)^T(Xw-y) + \frac{1}{2}\alpha w^Tw
$$

This changes the normal equations for the solution from

$$
w = (X^TX)^{-1}X^Ty
$$

to

$$
w = (X^TX + \alpha I)^{-1}X^Ty
$$

The diagonal entries of this matrix correspond to the variance of each input feature. We can see that $L^2$ regularization causes the learning algorithm to “perceive” the input X as having higher variance, which makes it shrink the weights on features whose covariance with the output target is low compared to this added variance.

## $L^1$ Regularization

Formally, $L^1$ regularization on the model parameter w is deﬁned as

$$
\Omega(\theta) = \Vert w \Vert_1 = \sum_i |w_i|
$$

We are interested in delineating the differences between $L^1$ and $L^2$ forms of regularization. $L^1$ weight decay controls the strengthof the regularization by scaling the penalty $\Omega$ using a positive hyperparameter $\alpha$.

Thus, the regularized objective function $\tilde{J}(w; X, y)$ is given by
$$
\tilde{J}(w;X,y) = \alpha\Vert w \Vert_1 + J(w;X,y)
$$

with corresponing graadient (actually, sub gradient)

$$
\nabla_w\tilde{J}(w;X,y) = \alpha sign(w)+ \nabla_w J(X,y;w)
$$

Where sign(w) is simply the sign of w applied element-wise. We can see that the regularization contribution to the gradient no longer scales linearly with each $w_i$; instead it is a constant factor with a sign equal to $sign(w_i)$. One consequence of this form of the gradient is that we will not necessarily see clean algebraic solutions to quadratic approximations of $J(X, y;w$) as we did for $L^2$ regularization.

Because the $L^1$ penalty does not admit clean algebraic expressions in the case of a fully general Hessian, we will also make the further simplifying assumption that the Hessian is diagonal, $H = diag([H11, . . . , Hnn])$, where each $H_{ii}>0$.

This assumption holds if the data for the linear regression problem has beenpreprocessed to remove all correlation between the input features, which may beaccomplished using PCA

Our quadratic approximation of the $L^1$ regularized objective function decomposes into a sum over the parameters:

$$
\hat{J}(w;X,y) = J(w^*;X,y) + \sum_i\left[\frac{1}{2} H_{ii}(w_i-w_i^*)^2+\alpha|w_i|\right]
$$

The problem of minimizing this approximate cost function has an analytical solution(for each dimension i), with the following form:

$$
w_i = sign(w_i^*)\max \left\{|w_i^*| - \frac{\alpha}{H_{ii}}, 0\right\}
$$

Consider the situation where $w_i^* > 0$ for all i. There are two possible outcomes:

1. The case where $w_i^* \leq \frac{\alpha}{H_{ii}}$. Here the optimal value of $w_i$ under the regularized objective is simply $w_i= 0$. This occurs because the contribution of $J(w;X, y)$ to the regularized objective $\tilde{J}(w;X, y)$ is overwhelmed—in direction i—by the $L^1$ regularization, which pushes the value of $w_i$ to zero.

2. The case where $w^∗_i > \alpha H_{ii}$. In this case, the regularization does not move the optimal value of $w_i$ to zero but instead just shifts it in that direction by adistance equal to $\frac{\alpha}{H_{ii}}$

A similar process happens when $w^∗_i < 0$, but with the $L^1$ penalty making $w_i$ less negative by $\alpha H_{ii}$ or 0. In comparison to $L^2$ regularization, $L^1$ regularization results in a solution thatis more **sparse**. Sparsity in this context refers to the fact that some parametershave an optimal value of zero. 

The sparsity of $L^1$ regularization is a qualitatively different behavior than arises with $L^2$ regularization. Using the assumption of a diagonal and positive deﬁnite Hessian $\textbf{H}$ that we introduced for our analysis of $L^1$ regularization, we ﬁnd that 
$\tilde{w_i} = \frac{H_{ii}}{H_{ii} + \alpha} w^∗_i$. 
If $w_i^*$ was nonzero, then $\tilde{w_i}$ remains nonzero. This demonstrates that $L^2$ regularization does not cause the parameters to become sparse, while $L^1$ regularization may do so for large enough $\alpha$.

The sparsity property induced by $L^1$ regularization has been used extensively as a **feature selection** mechanism. Feature selection simpliﬁes a machine learning problem by choosing which subset of the available features should be used. In particular, the well known LASSO (Tibshirani, 1995) (least absolute shrinkage and selection operator) model integrates an $L^1$ penalty with a linear model and a least-squares cost function. The $L^1$ penalty causes a subset of the weights to become zero, suggesting that the corresponding features may safely be discarded.

We saw that many regularization strategies can be interpretedas MAP Bayesian inference, and that in particular, $L^2$ regularization is equivalent to MAP Bayesian inference with a Gaussian prior on the weights. For $L^1$ regularization, the penalty $\alpha \Omega(w) = \alpha\sum_i|w_i|$ used to regularize a cost function is equivalent to the log-prior term that is maximized by MAP Bayesian inference when the prior is an isotropic Laplace distribution over $w \in \mathbb{R}^n$.

$$
\log p(w) = \sum_i \log Laplace\left(w_i;0, \frac{1}{\alpha}\right) = 
-\alpha\Vert w\Vert_1 + n\log\alpha - n\log 2
$$

From the point of view of learning via maximization w.r.t. to w, we can ignore the $\log \alpha - \log 2$ terms because they do not depend on w.